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Distributed  Priority  Algorithms  Under 
One-Bit- Delay  Constraint 


Reuven  Cohen*  Adrian  Segall 

IBM  T.  J.  Watson  Research  Center  Dept,  of  Computer  Science 

Yorktown  Heights,  NY  10598,  USA  Technion,  Haifa  32000,  Israel 


Abstract 

The  paper  deals  with  the  issue  of  station  delay  in 
token-ring  networks.  It  explains  why  one-bit-delay 
is  the  minimum  possible  delay  at  every  station 
and  shows  that  the  station  delay  depends  on  the 
distributed  computations  performed  in  the  ring. 
Then,  the  paper  introduces  the  distributed  prior¬ 
ity  mechanism  for  token-rings,  as  approved  by  the 
IEEE-802.5  standard.  This  mechanism  attaches  to 
the  token,  that  circulates  around  the  ring  and  con¬ 
trols  the  access  to  the  shared  medium,  a  priority 
field  P  and  a  reservation  field  R.  These  two  fields 
work  together  in  an  attempt  to  match  the  service 
priority  of  the  ring  to  the  highest  priority  message 
that  is  waiting  to  be  sent. 

It  is  shown  that  due  to  the  computation  restric¬ 
tions  imposed  by  the  one-bit-dela;  requirements, 
this  priority  mechanism  has  a  grave  deficiency  as 
follows.  When  the  token  priority  is  higher  than  the 
maximum  reservation  (P  >  R),  the  token  should 
make  up  to  P  round-trips,  where  P  is  the  number 

'This  work  was  conducted  in  part  when  this  author  was 
with  the  Dept,  of  Computer  Science,  Technion,  IIT. 
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of  priority  levels,  before  P  is  reduced  to  R.  During 
this  time  period,  no  station  may  seize  the  token  and 
send  a  message.  This  leads  to  loss  of  bandwidth. 

The  paper  presents  a  new  priority  mechanism 
that  retains  the  desired  properties  of  the  standard. 
However,  in  the  new  protocol  when  P  >  R  holds, 
P  is  reduced  to  R  in  at  most  1  round-trip  rather 
than  in  up  to  P  round-trips. 

1  One-Bit-Delay  In  Ring  Net¬ 
works 

In  a  token-ring’  Local  Area  Network  (LAN),  the 
stations  are  located  on  a  directed  ring  and  each  sta¬ 
tion  transmits  to  its  downstream  neighbor.  A  short 
control  message,  called  token,  circulates  around  the 
ring  and  regulates  the  access  to  the  shared  medium. 

Each  station  m  the  ring  can  be  in  REPEAT  or 
TRANSMIT  mode.  A  station  in  repeat  mode  trans¬ 
mits  to  its  downstream  neighbor  almost  the  same 
bit  stream  it  receives  from  upstream.  However,  ev¬ 
ery  received  message  contains  a  header  with  several 
bits  that  can  be  changed  by  stations  in  repeat 
mode.  These  bits  will  be  called  control  bits. 

A  station  in  REPEAT  mode  that  has  a  message 
to  send  waits  to  receive  the  token.  When  the  token 

'in  this  paper  we  consider  the  'traditional'  token-ring 
networks,  that  use  moderate  transmission  rate  —  up  to  16 
Mb/s.  In  high-speed  token-rings  (like  the  100  Mb/s  FDDI), 
one-bit-delay  is  neither  possible  nor  necessary. 
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I  received,  the  station  does  not  transmit  it  to  its 
lownstream  neighbor.  By  this  action,  the  station 
eizes  the  token  and  gets  transmission  rights,  so  it 
:an  send  its  waiting  message  into  the  ting.  The 
nessage  circulates  around  the  ting  and  is  copied 
>y  its  destination  (that  does  not  remove  it).  When 
he  sender  receives  the  message  back  after  one  ring 
evolution,  it  removes  the  message  from  the  ring, 
ssues  a  new  token  and  returns  to  REPEAT  mode. 
Consequently,  the  next  station  to  have  the  opportu- 
lity  of  gaining  transmission  rights  will  be  its  down¬ 
stream  neighbor. 

This  access  control  scheme  implies  that  at  any 
jiven  time  at  most  one  station  can  be  in  TRANSMIT 
mode.  At  the  time  when  some  station  is  in 
TRANSMIT  mode,  all  the  other  stations  are  in 
REPEAT  mode,  repeating  the  message  sent  by  the 
[ormer.  This  ensures  that  every  sent  message  will 
be  received  by  its  destination,  complete  a  round- 
trip  and  return  to  its  sender. 

Consider  a  station  in  repeat  mode.  Let 
be  the  first  bit  the  station  receives  after  entering 
this  mode,  be  the  second  incoming  bit  and  so 
forth.  Similarly,  let  be  the  first  bit  the  station 
transmits  after  entering  REPEAT  mode,  Bf  be  the 
second  bit  and  so  forth.  Recall  that  for  every  », 
if  B,  is  not  a  control  bit,  then  B,  is  repeated  (i.e, 
B*  «—  B~)  and  B^  =  B~  holds.  However,  if  B,  is 
a  control  bit  then  B^  can  be  different  from  B~. 

The  station  delay  is  defined  as  the  interval  be¬ 
tween  the  time  the  station  starts  receiving  a  bit  and 
the  time  it  starts  transmitting  the  same  bit,  either 
with  or  with  no  change.  It  is  convenient  to  express 
the  station  delay  in  terms  of  one  bit  time;  namely, 
the  time  required  to  transmit  one  bit  (1/T,  where 
T  is  the  transmission  rate).  A  station  that  starts 
transmitting  B^  at  the  time  when  it  starts  receiv¬ 
ing  B~  introduces  no  delay.  On  the  other  hand. 


a  station  that  starts  transmitting  Bf  at  the  time 
when  it  starts  receiving  B~^j  works  with  delay  of  j 
bits. 

When  the  transmission  rate  in  the  ring  does  not 
exceed  several  Megabit/s,  the  delay  introduced  by 
the  stations  is  the  main  factor  ^f  the  round-trip 
delay.  For  instance,  consider  a  4  Mb/s  token-ring 
LAN  with  100  stations.  Suppose  that  the  average 
distance  between  every  two  neighbors  is  10m.  As¬ 
suming  propagation  delay  of  2  •  iO^mfsec,  such  a 
ring  would  have  a  round-trip  delay  of 

4  •  10® 

(100  •  10)^-^  -I- 100  •  L?  =  20  +  100  •  £>  bits 

where  D  is  the  delay  at  each  station.  This  means 
that  the  propagation  delay  is  only  20  bits,  and  that 
the  de*ay  in  a  ring  whose  stations  have  j-bit  delay 
is  almost  j  times  the  delay  of  a  ring  whose  stations 
work  with  one-bit-delay. 

Minimizing  the  round-trip  delay  is  an  important 
object  in  token- tings,  since  in  such  networks  the 
throughput  increases  as  the  delay  decreases  [1]. 
Moreover,  a  token-ring  whose  round-trip  delay  is 
small  can  support  the  transmission  of  real-time 
data. 

A  token-ring  station  cannot  work  with  zero  de¬ 
lay.  Zero  delay  can  be  achieved  only  in  a  ‘dead’ 
station,  which  is  not  expected  to  change  the  control 
bits.  Such  a  station  is  short-circuited  and,  there¬ 
fore,  it  is  bypassed  by  the  signal.  An  operational 
station,  however,  is  never  short-circuited  since  it 
is  expected  to  change  some  of  the  received  bits. 
Therefore,  in  order  to  repeat  a  bit,  an  operational 
(i.e,  not  short-circuited)  station  must  completely 
receive  the  bit  and  then  transmit  it.  One  may  con¬ 
sider  a  ‘0-delay  scheme’  according  which  a  station 
is  short-circuited  before  receiving  a  bit  that  should 
be  repeated  and  is  reconnected  before  receiving  a 
bit  that  should  be  changed.  However,  this  would 
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not  work  since  switching  to  and  from  short-circuit 
mode  takes  time  and  results  in  loss  of  many  bits. 

The  conclusion  is  that  the  delay  at  every  oper¬ 
ational  station  is  of  at  least  one  bit.  However,  in 
order  to  work  with  one- bit-delay,  a  station  must  be 
able  to  start  transmitting  every  control  bit,  whose 
outgoing  value  is  not  necessarily  equal  to  its  in¬ 
coming  value,  as  soon  as  it  completes  receiving  the 
corresponding  bit.  This  implies  that  when  some 
control  bit  is  not  repeated,  its  outgoing  value  must 
be  determined  independently  of  Us  incoming  value 
and  of  the  value  of  subsequent  incoming  hits. 

For  example,  suppose  that  5,  is  a  control  bit. 
Suppose  also  that  the  station  starts  receiving  this 
bit  at  t,  thus  completes  receiving  it  at  /  -I- 1.  If  the 
value  of  is  a  function  of  the  value  of  B~ ,  the 
station  cannot  start  computing  this  function  before 
t  -f-  1,  when  B~  is  known.  Therefore,  the  station 
cannot  start  transmitting  B'^  at  <  -b  1,  but  only 
after  the  computation  is  completed.  If  the  outgo¬ 
ing  value  of  a  bit  depends  on  the  incoming  value 
of  subsequent  bits,  the  station  delay  increases  ac¬ 
cordingly. 

Conclusion:  When  the  ring  stations  work  with 
one-bit-delay,  every  outgoing  bit  B^  is  either  a 
repetition  of  the  corresponding  incoming  bit  (i.e. 
Bf  «—  B~)  or  can  be  expressed  as  a  function  .F 
that  is  not  dependent  on  the  values  of  B~ ,  Bf^^, . . . 

Note  that  the  decision  of  a  station  that  works 
with  one- bit-delay  as  to  whether  to  repeat  5,, 
or  to  compute  and  transmit  this  bit  according 
to  some  function  iF  that  is  not  dependent  on 
>  ^i+l>  ■  ■  ■,  can  be  made  according  to  the  value 
of  B~_^,  5,12, •••.  instance,  consider  the 

problem  of  calculating  and  transmitting  the  max¬ 
imum  of  an  incoming  N-bit  string,  that  repre¬ 
sents  a  binary  number,  and  a  local  one.  Let 


{5J",  B^,  --,  BJi)  be  the  incoming  string,  where 
the  first  bit  is  the  most  significant  one,  and  C  — 
{Ci,C2,  -  •  • ,  be  the  stored  number.  For  exam¬ 
ple,  for  iV  =  3,  if  the  input  string  is  Oil  and  C  is 
100,  the  outgoing  string  should  be  max(011, 100)  = 
100,  whereas  if  C  =  001,  the  outgoing  string  should 
be  Oil.  The  following  one- bit-delay  algorithm  cal¬ 
culates  and  transmits  the  maximum: 
i  —  1 

*  \{Ci  =  I  then  *—  1  else  *—  B~ 
if  Bf^  yt  B~  then  , . . . ,  B^  C.+i , . . . ,  Cs 
else  if  B^  yt  C,  then  , . . . ,  •-  Bfj 

else  i  «—  i  -I-  1;  if  i  <  TV  go  to  * 

The  minimum  of  an  incoming  string  and  and  a  local 
one  can  be  computed  and  transmitted  similarly. 

As  explained  so  far,  a  lower  bound  on  the  station 
delay  is  imposed  by  the  rules  that  determine  the 
outgoing  values  of  the  control  bits.  Since  these  rules 
are  part  of  the  Medium  Access  Control  (MAC)  pro¬ 
tocol,  it  can  be  said  that  the  station  delay  depends 
on  the  access  control  protocol  executed  in  the  ring. 

In  the  simplest  version  of  the  token-ring  access 
control  protocol,  all  messages  have  the  same  pri¬ 
ority.  This  implies  that  a  station  with  a  waiting 
message  can  seize  the  token  upon  receiving  it  and 
can  transmit  its  message  into  the  ring  under  no 
further  restriction.  Since  the  token  is  released  by 
the  same  station  after  the  latter  completes  a  ring 
revolution,  this  simple  mechanism  ensures  fairness 
in  the  sense  that  the  transmission  rights  are  passed 
Iroin  each  station  to  its  downstream  neighbor.  Ref¬ 
erences  [2]  and  [5]  deal  with  one-bit-delay  imple¬ 
mentation  of  this  protocol.  However,  in  order  to 
support  multiple  services  with  different  time  re¬ 
quirements,  like  real-time  voice  samples,  interac¬ 
tive  data,  files  transfer  and  so  forth,  most  Local 
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Area  Networks  use  some  means  of  priority  mecha¬ 
nism  [6]. 

The  rest  of  the  paper  deals  with  the  design  of  dis¬ 
tributed  priority  mechanisms  in  token-rings  under 
the  restriction  of  one-bit-delay.  Section  2  presents 
the  priority  mechanism  for  token-rings  as  approved 
by  IEEE-802.5  standard.  It  shows  that  this  mech¬ 
anism  was  designed  such  that  the  ring  stations 
would  be  able  to  work  with  one-bit-delay.  However, 
due  to  the  computation  restrictions  imposed  by 
the  one-bit-delay  requirements,  this  priority  mech¬ 
anism  may  result  in  loss  of  bandwidth  and  starva¬ 
tion. 

Section  3  presents  an  alternative  distributed 
scheme.  The  new  mechanism  retains  the  fairness 
and  liveness  properties  but  it  eliminates  the  defi¬ 
ciencies  associated  with  the  standard  protocol.  The 
new  protocol  has  been  designed  for  one-bit-delay 
operation  as  well.  However,  it  uses  several  prop¬ 
erties  of  (he  protocol  variables  to  ensure  that  the 
outgoing  value  of  some  bit  5,  is  the  correct  funcdon 
of  the  received  value  of  subsequent  bit  Bj,  where 
j  >  J,  without  waiting  to  receive  Bj.  The  formal 
specification  of  the  new  protocol  is  presented  in 
Section  4  and  its  main  properties  are  summarized 
in  Section  5.  Section  6  concludes  the  paper. 

2  Priority  Mechanism  in  Token- 
Rings 

The  distributed  priority  mechanism  for  token- 
rings,  as  approved  by  IEEE-802. 5  standard,  can  be 
summarized  as  follows  [4]. 

1)  The  token  (also  called  token-frame)  consists  of 
three  fields,^  as  shown  in  Figure  1(a):  the  Priority 

’in  fact  there  is  an  additional  one-bit  field,  called  \f , 
which  is  required  for  recovery.  However,  it  plays  no  role 
in  the  priority  mechanism,  and  therefore  it  is  disregarded. 
Some  irrelevant  fields  in  the  data-frame,  as  the  CRC,  are 
omitted  as  well. 


field  P  (3  bits),  the  Token  bit  T  which  is  always  0, 
and  the  Reservation  field  R.  A  station  that  seizes 
the  token  changes  it  into  a  data-frame  by  setting 
the  token  bit  T  to  1  and  appending  Destination, 
Source  and  Data  fields.  The  first  two  fields  con¬ 
tain  the  identity  of  the  destination  and  the  sender, 
respectively.  The  Data  field  contains  the  message 
to  be  sent.  Figure  1(b)  shows  the  structure  of  a 
data-frame. 

2)  Each  message  is  associated  with  a  priority.  The 
most  urgent  messages  have  priority  7,  whereas  the 
least  urgent  messages  have  priority  0.  The  priority 
range  can  be  extended  by  increasing  the  appropri¬ 
ate  fields  (P  and  R). 

3)  A  station  wishing  to  send  a  message  with  pri¬ 
ority  Pm  must  wait  for  a  token  (T  =  0)  with 
P  <  Pm.  When  such  a  token  is  received,  the 
station  converts  it  into  a  data-frame  (by  setting 
T  *-  1  and  appending  Source,  Destination  and 
Data  fields)  and  resets  the  reservation  field  R  to 
0.  In  addition,  the  station  changes  its  mode  from 
REPEAT  to  TRANSMIT. 

4)  While  waiting  for  a  usable  token  (i.e,  a  frame  for 
which  T~  =  0  and  P  <  Pm  holds),  a  station  may 
reserve  a  future  token  with  the  required  priority 
Pm  by  setting  the  R  field  to  Pm  in  every  token-  and 
data-frame  it  receives,  provided  that  the  received 
R  is  not  larger  than  Pm.  This  means  that  such 
a  station  should  perform  R^  *—  max(/?'',  Fm).  As 
shown  before,  this  can  be  done  with  one-bit-delay. 

5)  A  station  in  TRANSMIT  mode  waits  to  receive 
the  frame  with  its  message  back,  after  the  lat¬ 
ter  completes  a  ring  revolution.  When  the  data- 
frame  is  received  back,  the  station  issues  a  new 
token  by  changing  T  back  to  0  and  removing  the 
Destination,  Source  and  Data  fields.  The  priority 
field  P  of  the  new  token  is  .set  to  the  maximum 
of  R~  (the  reservation  field  in  the  received  data- 
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(a)  Token- Frame 
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(b)  Data-Frame 

Figure  1;  a  Token-  and  a  Data-Frame  in  IEEE-802.5  Token-Ring 


frame),  P~  (the  priority  field  in  the  received  data- 
frame)  and  Pm  (the  priority  of  the  most  urgent 
message  the  station  needs  to  send).  This  implies 
that  when  a  data-frame  is  changed  into  a  token- 
frame,  >  P~  holds. 

6)  A  station  that  increases  the  service  priority  of 
the  ring,  while  changing  a  data-frame  containing  its 
own  message  to  a  token-frame,  is  responsible  for  re¬ 
turning  the  priority  to  its  former  level  later.  This 
prevents  livehck  situations,  where  the  token  circu¬ 
lates  the  ring  indefinitely  just  because  its  priority 
level  P  is  higher  than  the  priority  of  any  waiting 
message.  As  explained  later,  this  is  also  the  key 
means  for  ensuring  fairness  among  stations. 

In  order  to  be  able  to  restore  the  old  value  of  P, 
a  station  that  increases  P  stores  the  old  and  new 
values  in  local  variables  Sr  and  Sx,  respectively. 
Later,  when  the  station  detects  a  token  with  prior¬ 
ity  P  =  Sx,  it  decreases  the  priority,  unless  it  has 
a  waiting  message  with  Pm  >  P.  The  new  value 
P'*’  is  the  maximum  of  the  priority  of  the  token  be¬ 
fore  the  station  had  increased  it  (as  stored  in  Sr), 
the  maximum  reservation  R~ ,  and  the  priority  of 
the  most  urgent  local  message  Pm.  If  a  station 
increases  the  priority  from  p  to  j>'  say,  and  later 
decreases  it  to  an  intermediate  value  p",  it  replaces 


the  value  p'  in  Sx  with  p".  This  will  enable  the 
station  to  later  decrease  the  priority  from  p"  to  p 
or  again  to  an  intermediate  value. 

Since  a  station  may  increase  the  priority  level 
more  than  once  before  decreasing  it,  Sx  and  Sr 
should  be  stacks.  Values  are  pushed  into  the  stacks 
every  time  the  station  increases  P  and  are  popped 
when  the  priority  is  decreased  to  the  old  level. 
When  a  station  decreases  the  priority  level  to  an 
intermediate  value,  the  latter  replaces  the  top  of 
Sx,  while  Sr  is  unchanged. 

This  approach,  where  only  a  station  that  had 
increased  the  priority  from  p'  to  p"  can  decrease 
it  from  p"  to  p',  in  one  or  more  steps,  is  the  basic 
means  for  achieving  fairness.  It  ensures  that  if  the 
priority  increases  and  then  drops  to  its  previous 
value  p,  the  first  station  to  benefit  from  the  new 
level  is  the  downstream  neighbor  of  the  station  that 
has  increased  P. 

A  station  that  increases  the  token  priority  is 
called  a  stacking  station.  It  remains  in  this  cat¬ 
egory  as  long  as  its  stacks  are  not  empty. 

Almost  all  the  above  operations  can  be  imple¬ 
mented  with  one- bit-delay.  The  only  difficulty  is 
at  stacking  stations.  Recall  that  whenever  it  re¬ 
ceives  a  frame,  such  a  station  is  expected  to  do  the 


5 


following  operation; 

“if  (r-  =  0)  A  (P-  =  Sx)  A  (P-  >  Pm) 

then  P+  <—  inax(P~,  Sx,  Pm)” 
Since  T  and  R  come  after  P,  this  operation  cannot 
be  done  with  one-bit-delay.  The  IEEE-802.5  stan¬ 
dard  copes  with  this  difficulty  in  the  following  way. 
A  stacking  station  that  receives  a  token-  or  a  data- 
frame  repeats  P  with  no  change  (P'*'  ♦-  P“).  After 
knowing  P~ ,  the  station  checks  whether  Pm  >  P~ 
holds,  in  which  case  the  station  sends  the  message 
provided  that  T~  =  0. 

However,  if  Pm  <  P~,  the  station  compares  P~ 
with  the  top  of  its  Sx  stack.  If  the  two  values  are 
equal,  the  station  must  decrease  P,  provided  that 
T~  =  0.  Therefore,  immediately  when  T—  is  com¬ 
pletely  received,  the  station  transmits  T"*"  =  1  and 
then  tests  the  received  value  of  T.  As  explained  in 
Section  1,  this  is  a  one- bit-delay  operation  although 
T  is  not  repeated,  because  T'^  is  determined  inde¬ 
pendently  of  T~  and  subsequent  bits. 

If  T~  happens  to  be  1,  namely  the  station  is  not 
supposed  to  alter  P,  it  transmits  the  rest  of  the 
incoming  frame  with  no  change.  In  such  a  case 
the  station  works  with  one-bit-delay  and  performs 
exactly  what  it  is  expected  to;  repeat  P  and  T  with 
no  change. 

On  the  other  hand,  if  the  value  of  T~  is  0  and 
P~  =  Sx,  the  station  has  done  2  ‘mistakes’;  it 
has  transmitted  P  with  no  change,  rather  than  set¬ 
ting  it  to  the  maximum  of  R~  and  Pm  and  it  has 
changed  T  to  1  rather  than  repeating  it  with  no 
change.  These  two  mistakes  are  corrected  in  the 
following  way.  The  station  generates  a  new  token- 
frame  with  the  appropriate  priority  field.  Then  it 
waits  to  receive  the  first  ‘frame’,  which  is  neither  a 
token-frame  nor  a  data-frame,  and  removes  it  from 
the  ring. 

The  above  procedure  enables  a  stacking  station 


to  operate  with  one-bit-delay  as  long  as  it  should 
not  change  P.  Namely,  when  (1)  the  station  re¬ 
ceives  a  token,  but  the  top  value  of  the  5z  stack  and 
the  priority  of  the  token  are  different  {Sx  ^  P“)  or 
(2)  the  station  receives  a  token,  but  it  has  a  waiting 
message  with  a  priority  equal  to  or  greater  than  the 
priority  of  the  token  (Pm  >  P~)  or  (3)  the  station 
receives  a  data-frame  {T~  =  1). 

On  the  other  hand,  when  the  stacking  station 
should  decrease  the  priority,  it  not  only  introduces 
higher  delay,  but  also  temporarily  generates  a  sec¬ 
ond  frame.  As  explained  in  [2],  in  many  cases  frame 
multiplication,  even  temporary,  is  unacceptable  or 
at  least  undesirable. 

However,  the  main  deficiency  of  the  standard 
protocol  is  that  when  P  >  R  the  token  may  make 
up  to  (P—  R)  round-trips  before  P  is  reduced  to  R. 
Consider  a  token  with  priority  7  and  suppose  that 
the  value  in  the  reservation  field  is  0,  indicating 
that  P  should  drop  to  0.  Suppose  also  that  there 
are  seven  stacking  stations,  each  of  which  have  in¬ 
creased  the  token  by  one  level.  In  such  a  case, 
the  priority  will  be  decreased  in  7  steps  as  follows. 
In  the  first  step,  the  stacking  station  that  has  up¬ 
graded  the  priority  from  6  to  7  decreases  it  to  6. 
In  the  second  step,  the  priqrity  is  decreased  from 
6  to  5,  and  so  forth.  The  number  of  round-trips 
required  to  achieve  P  =  0  depends  on  the  relative 
location  of  the  stacking  stations.  In  the  worst  case 
(where  the  station  that  decreases  P  from  7  to  6 
is  the  downstream  neighbor  of  the  station  that  de¬ 
creases  P  from  6  to  5,  and  the  latter  is  the  down¬ 
stream  neighbor  of  the  station  that  decreases  P 
from  5  to  4,  and  so  forth),  almost  7  round-trips  are 
required. 

In  those  cases  where  several  round-trips  are  re¬ 
quired  to  reduce  P  to  R,  no  station  may  seize 
the  token  and  send  messages  since  P  is  too  high. 
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Theiefoie,  the  ting  bandwidth  dating  this  time  in- 
tetval  is  lost.  Moteovet,  while  the  ptiotity  ctawls 
to  the  desited  level,  new  utgent  messages  may  be¬ 
come  teady  fot  ttansmission.  These  messages  may 
cause  the  ptiotity  field  to  inctease  befote  the  least 
utgent  messages  ate  setved.  Since  such  a  condition 
may  tepeat,  the  messages  with  lowet  ptiotity  might 
have  to  wait  fotevet. 

It  should  be  cleat  that  the  only  teason  fot  the 
gtadual  dectease  of  P  to  is  enabling  the  stack¬ 
ing  stations  to  opetate  most  of  the  time  with  one- 
bit-delay  as  shown  above.  If  this  testtiction  wete 
eliminated,  namely  stacking  stations  wete  allowed 
to  teceive  and  intetptet  the  incoming  value  of  T 
and  R  befote  detetmining  the  outgoing  value  of  P, 
it  would  have  been  possible  to  teduce  P  to  JR  in  at 
most  1  tound-ttip  while  ptesetving  the  faitness  and 
liveness  ptopetties. 

3  The  New  Mechanism 

The  new  mechanism  is  modification  of  the  IEEE- 
802.5  one.  While  it  can  be  petfotmed  with  one- 
bit-delay  and  it  ensutes  faitness  and  liveness  in  the 
same  sense  the  IEEE)-802.5  ptotocol  does,  the  new 
ptotocol  reaches  the  required  priority  level  consid¬ 
erably  faster  than  in  the  standard,  without  unneces¬ 
sarily  going  through  intermediate  levels.  The  only 
penalty  in  the  new  ptotocol  is  an  extra  3-bit  field  in 
the  token  and  data-ftames,  which  is  insignificant. 

The  basic  elements  of  the  new  ptotocol  ate  as 
foUows.  (1)  The  token  and  data- frames  contadn  a 
token  bit  T,  a  ptiotity  field  P  and  two  reservation 
fields,  Pi  and  Rj,  as  shown  in  Figure  2.  (2)  Each 
station  manages  a  loc^  vector  of  8  entries,  instead 
of  two  stacks.  This  vector  plays  the  same  role  as  the 
stacks  in  the  original  mechanism,  namely  ensutes 
faitness  and  liveness.  (3)  The  rules  for  increasing 


the  ptiotity  levels  ate  the  same  as  in  the  standard. 
The  decreasing  rules,  however,  ate  different. 

In  the  new  ptotocol,  the  maximal  reservation  is 
held  in  the  first  reservation  field  Pi  of  a  token 
(T  =  0),  and  in  the  second  reservation  field  P2  of 
a  data-frame  (T  =  1).  The  second  reservation  field 
R2  in  a  token-frame,  and  the  first  reservation  field 
Pi  in  a  data-frame  play  no  role.  When  a  station 
that  wishes  to  send  a  message  with  priority  Pm  re¬ 
ceives  a  fi3..i.e  (either  a  token-  or  a  data-frame),  it 
increases  Pi  to  Pm.  provided  that  P^  <  Pm.  Since 
when  the  station  receives  Pi  it  does  not  know  yet 
the  value  of  the  token  bit  T,  the  reservation  in  Pi  is 
only  tentative.  If  the  station  recognizes  later  that 
T~  =  1,  it  makes  another  reservation  in  P2. 

After  trying  to  increase  Pi,  a  station  that  wishes 
to  send  a  message  receives  the  priority  field  P  and 
tests  whether  its  value  is  less  than  or  equal  to  the 
priority  of  its  waiting  message  Pm.  If  Pm  >  P~ 
and  the  station  does  not  change  P  as  explained 
later  (i.e,  P~  =  P"^)  and  T~  =  0,  the  station  is  al¬ 
lowed  to  send  its  message.  Suppose  that  Pm  >  P~ 
and  P~  =  P"*"  hold.  At  the  time  when  T~  is 
completely  received,  the  station  starts  transmitting 
P"’’  «—  1  independently  of  T~  (this  is  a  one-bit- 
delay  operation),  and  tests  the  received  value  of 
T.  II  T~  =  1,  the  station  cannot  send  its  mes¬ 
sage  since  the  received  frame  is  not  a  token.  Thus, 
it  only  tries  to  make  a  reservation  in  P2.  How¬ 
ever,  if  the  received  value  of  T  is  0,  the  station 
resets  P2  to  0  and  sends  its  message.  Since  P2 
is  reset  to  0  whenever  a  token-frame  is  converted 
into  a  data-frame  (like  P  in  the  original  protocol), 
when  the  latter  is  received  back  by  the  sender,  it 
contains  the  maximal  reservation  made  by  the  sta¬ 
tions  in  the  last  round-trip.  Note  that  Pi  cannot 
be  used  to  indicate  the  required  reservation,  since 
bom  one-bit-delay  considerations  it  cannot  be  re- 
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Figure  2:  a  Token- Frame  in  The  New  Scheme 
(for  a  Data-Frame,  T  =  1  holdfi  and  Destination,  Source  and  Data  field  are  appended) 


set  to  0  when  the  station  knows  that  the  received 
frame  is  a  usable  token. 

A  station  that  seizes  the  token  changes  its  mode 
from  REPEAT  to  TRANSMIT.  In  TRANSMIT  mode 
the  station  transmits  its  message  in  the  Data  held 
of  the  data-frame  and  waits  for  the  frame  to  come 
back.  Upon  receiving  the  frame  back,  the  station 
converts  it  from  a  data-frame  into  a  token-frame 
by  changing  the  token  bit  to  0  and  stripping  the 
Destination,  Source  and  Data  fields.  In  addition, 
if  P~  <  max{R^,Pm)  the  station  adapts  the  to¬ 
ken  priority  to  the  highest  reservation  by  setting 
P"*"  *—  max{R^ ,  Pm).  If  P~  >  max(ilj.  Pm), 
namely  the  highest  reservation  is  less  than  the  cur¬ 
rent  priority,  there  is  only  one  station  that  may 
decrease  the  priority  to  the  highest  reservation. 
This  is  the  last  station  to  have  sent  a  message 
using  a  token  with  prioriij  less  than  or  equal  to 
max(i?2  ,  Pm).  This  rule  ensures  fairness. 

Note  that  whether  P~  >  max(PJ,Pm)  holds 
or  not,  the  station  performs  R*  *—  max(Pj,Pm). 
This  is  because  in  a  token-frame  Ri,  rather  than 
R2,  indicates  the  highest  reservation.  This  implies 
that  a  station  in  TRANSMIT  mode  does  not  oper¬ 
ate  with  one- bit-delay.  This  fact,  which  holds  for 
the  original  protocol  as  well,  is  meaningless  be¬ 
cause  a  token-frame  never  encounters  a  station  in 
TRANSMIT  mode,  and  a  data-frame  encounters  only 
one  station  in  this  mode  —  its  sender. 

In  the  new  protocol,  changes  in  the  priority  level 
ate  not  remembered  in  stacks.  Stacks  are  not  con¬ 


venient  here  since  if,  for  example,  a  station  raises 
the  priority  from  1  to  3  and  afterwards  from  5  to 
6,  we  allow  the  station  to  decrease  it  in  one  step 
from  6  or  7  not  only  to  5,  but  also  to  1  or  2.  There¬ 
fore,  we  use  here  a  vector  V  with  8  three-bit  entries 
(assuming  there  are  8  priority  levels).  The  vector 
V  at  station  t  is  denoted  by  Vi  and  its  r’th  en¬ 
try  by  Vi[r),  where  0  <  r  <  7.  At  initialization, 
for  every  r  Vi[r]  <—  7.  Afterwards,  when  station 
t  increases  P  from  p  to  p',  it  performs  ^^[r)  «—  r 
for  every  p  <  r  <  p'  and  K[^]  *—  min(p,  K[^])  for 
every  r  <  p.  For  example,  suppose  that  station  i 
increases  P  from  1  to  3  and  then  from  5  to  6.  After 
the  first  increase  Vi  =  (1, 1, 2, 7, 7, 7, 7, 7),  whereas 
after  the  second  increase  U,  =  (1, 1,2,5, 5,5, 7, 7). 
As  shown  later,  entry  ^[r]  indicates  the  value  sta¬ 
tion  t  should  decrease  P  to,  given  that  Ri  =  r  and 
some  additional  conditions  are  satisfied. 

The  new  priority  mechanism  can  now  be  sum¬ 
marized  as  follows: 

(a)  A  station  can  send  a  message  with  priority 
Pm  only  when  it  receives  a  token,  provided  that 
it  does  not  change  the  priority  field  of  this  token 
and  P  <  Pm  holds.  A  station  that  receives  a  token 
and  alters  P  can  only  decrease  it.  Allowing  such 
a  station  to  seize  the  token  and  send  its  message 
immediately  would  result  in  an  unfair  scheme. 

(b)  A  station  makes  a  reservation  for  a  token  with 
priority  level  Pm  by  setting  R*  max(Pj',Pm) 
in  every  received  frame.  If  the  station  detects 
later  that  T~  =  1,  namely  the  received  frame  is 
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a  data-frame,  it  makes  a  teseivation  in  As  by  set¬ 
ting  Aj  ♦-  max(AJ,  Am). 

(c)  A  station  that  does  not  change  A  and  recog¬ 
nizes  that  P~  <  Pm  and  T~  =  0  considers  the 
received  frame  as  a  usable  token.  The  station  may 
seize  the  token  by  replacing  it  with  a  data-frame 
containing  its  own  message.  This  is  done  by  chang¬ 
ing  T  from  0  to  1,  resetting  As  to  0  and  appending 
Destination,  Source  and  Data  fields. 

(d)  A  station  that  receives  its  data-frame  back 

after  one  round-trip  replaces  it  with  a  new  to¬ 
ken.  The  value  of  the  priority  field  A  is  de¬ 
termined  as  follows.  If  A~  <  max(Aj,Am), 
then  P*  *—  max(Aj,  Am).  If  P~  =  max(Aj,  Am), 
then  A  does  not  change  (A'*’  A~).  If 

P~  >  max(AJ,Am),  then  A  is  decreased  only  if 
the  conditions  specified  in  (e)  below  ate  satisfied. 
In  addition,  the  station  sets  Rf  «—  max(Aj,  Am). 

(e)  A  station  that  increases  A  from  p  to  p'  is  re¬ 
sponsible  for  decreasing  it  later  to  any  value  in  the 
range  \p  ,  p'  -  1].  Such  a  change  can  be  performed 
only  in  a  token  (T  =  0)  with  Ai  <  A. 

In  order  to  illustrate  rule  (e)  and  the  opera¬ 
tion  of  the  vector  V  that  replaces  the  stacks  of 
the  IEEE-802.5  standard,  suppose  that  station  t 
increases  A  from  0  to  3,  and  then  station  j  in¬ 
creases  it  from  3  to  7.  As  explained  before,  af¬ 
ter  this  happens,  V;  s=  (0,1, 2, 7, 7, 7, 7, 7)  and 
Vj  =  (3, 3, 3, 3, 4, 5, 6,7).  Suppose  that  a  token 
with  0  <  Ai  <  2  and  P  >  R\  circulates  around 
the  ring.  In  this  case,  one  of  the  following  scenar¬ 
ios  may  take  place; 

•  if  the  token  is  received  by  j  first,  j  reduces  A  to 
3,  since  Vj[R\]  —  3  holds  for  0  <  Ai  <  2;  after¬ 
wards,  station  t  reduces  A  to  the  desired  value 
Ai,  since  l^[Ai]  =  Ai  holds  for  0  <  Ai  <  2. 

•  if  the  token  is  received  by  t  first,  i  reduces 


A  directly  to  Ai,  since  Vi[Ai]  =  Ai  holds  for 
0  <  Ai  <  2. 

However,  if  Ai  <  A  but  3  <  Ai  <  6,  only  j  can 
decrease  A  (directly  to  the  required  value).  This  is 
because  in  such  a  case  ^[Ai]  <  A  does  not  hold, 
but  V^[Ai]  =  Ai  <  A  holds. 

In  order  to  illustrate  the  differences  between  the 
new  approach  and  the  one  of  IEEE-802.5,  consider 
the  following  scenario.  Suppose  the  priority  A  is 
0,  and  station  t  is  the  first  to  seize  the  token  and 
transmit  a  data-frame  with  a  message.  Suppose 
that  when  station  t  receives  the  frame  back,  it 
recognizes  that  the  maximal  reservation  (A  in  the 
IEEE-802.5  protocol,  A2  in  the  new  protocol)  is  2. 
Therefore,  in  both  mechanisms  t  increases  A  from 
0  to  2.  When  the  station  that  has  reserved  A  =  2, 
j  say,  receives  the  token  and  recognizes  that  A  =  2, 
it  seizes  the  token  and  transmits  a  data-frame  with 
its  own  message.  Suppose  that  when  the  frame  is 
received  back,  station  j  finds  that  the  next  highest 
reservation  is  lor  priority  4.  Therefore,  it  increases 
A  from  2  to  4,  and  the  next  station  that  has  made 
the  reservation  for  4,  k  say,  seizes  the  token  and 
transmits  a  data-frame.  When  this  station  receives 
the  frame  back,  it  finds  that  the  next  reservation 
is  for  priority  6.  Thus,  it  releases  the  token  with 
A  =  6,  and  the  station  that  has  made  the  reserva¬ 
tion  for  6,  I  say,  seizes  the  token  and  transmits  a 
&ame. 

So  far  all  events  in  the  new  protocol  are  similar 
to  those  in  the  old  one.  However,  suppose  that 
when  station  I  receives  the  frame  back,  it  finds  that 
the  maximal  reservation  is  for  priority  0,  namely 
no  station  holds  a  message  whose  priority  is  larger 
than  0.  In  the  IEEE-802.5  protocol,  only  station 
k,  which  was  the  last  to  have  increased  P  to  6  can 
decrease  the  priority  from  6.  Then,  j  can  decrease 
A  from  4  to  2.  Only  when  i  recognizes  that  A  =  2, 
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T  =  0  and  A  =  0,  it  decteases  P  to  0.  Depending 
on  the  location  of  t,  j,  k  and  /,  this  process  may 
take  np  to  3  round-trips.  In  the  general  case,  where 
the  ring  has  V  priority  levels,  this  process  takes  up 
to  V  round-trips.  In  the  new  protocol,  on  the  other 
hand,  when  station  t  receives  the  token  for  the  first 
time  (and  recognizes  that  the  maximal  reservation 
is  0,  but  P  >  0),  it  decreases  P  to  Vi[IZi]  =  0, 
whether  or  not  the  received  value  of  P  is  2.  This 
is  because  t  was  the  last  station  to  have  increased 
P  from  0. 

Next  we  describe  the  one-bit-delay  algorithm  for 
decreasing  the  priority  of  an  unusable  token,  for 
which  T~  =  0  and  T"*’  =  0  holds.  At  first  glance, 
this  seems  to  be  an  impossible  task.  This  is  be¬ 
cause  in  order  to  know  that  a  received  frame  is  an 
unusable  token,  a  station  must  completely  receive 
P  and  T,  but  when  T  is  received  (or  even  when 
P,  which  is  a  3-bit  field,  is  completely  received),  it 
is  too  late  to  decide  the  value  of  P  while  working 
with  one- bit-delay.  Therefore,  a  stacking  station 
must  know  before  receiving  P~  and  T~  whether 
the  received  frame  is  an  unusable  token.  This  is 
the  most  interesting  part  of  the  new  scheme. 

A  station  is  allowed  to  decrease  the  priority  P 
to  V[R\\,  where  K[Pi]  <  P,  if  and  only  if  it  re¬ 
ceives  an  unusable  token,  namely  when  T”  =  0 
and  P~  >  Pm  holds.  Note  that  this  condition  is 
mandatory  in  the  IEEE)-802.5  standard’s  protocol 
as  well.  The  reason  for  requiring  T"  =  0  is  that 
when  T~  =  1,  the  maximal  reservation  is  unknown, 
since  only  part  of  the  stations  have  had  the  oppor¬ 
tunity  to  make  a  reservation.  P~  >  Pm  is  required 
because  P~  <  Pm  indicates  that  the  current  ser¬ 
vice  priority  is  still  needed  or  even  should  be  in¬ 
creased. 

Upon  receiving  a  frame,  and  after  having  up¬ 
dated  Ri  {Rf  4—  max(/lf,  Pm)),  a  station  in 


REPEAT  mode  always  sets  P*  to  the  minimum  of 
and  P~  (as  shown  in  Section  1,  this  can  be 
done  with  one-bit-delay).  We  now  explain  why  this 
action  does  not  change  P  when  T~  =  1  or  when 
T~  =  0  and  P~  <  Pm. 

First  consider  the  case  where  T~  =  1  (the  re¬ 
ceived  frame  is  a  data-frame).  As  proved  in  [3], 
T~  =  1  implies  that  (a)  R^  >  P~.  Since 
Rf  4—  max(il^,  Pm),  (b)  Rf  >  R^  holds  too. 
From  (a)  and  (b)  follows  that  (c)  R^  >  P~.  In  ad¬ 
dition,  (d)  V’[r]  >  r  holds  for  every  station  at  any 
time.  From  (c)  and  (d)  follows  that  V[P^]  >  P~. 
Thus,  the  operation  P"*"  4—  min(V(P^],  P“)  does 
not  change  P. 

Now  consider  the  case  where  T~  =  0  and 
P~  <  Pm  (the  received  frame  is  a  usable  token). 
Since  P~  <  Pm  and  since  Rf  4—  max(Pj',  Pm), 
it  follows  that  Ry  >  P~ .  This  follows  similarly 
to  (c)  in  the  previous  case  and  so  is  the  rest  of  the 
proof. 

4  Formal  Specification 

Table  1  presents  the  formal  specification  of  the  sta¬ 
tion  algorithm.  Recall  that  a  station  is  usually  in 
REPEAT  mode,  in  which  case  it  may  seize  a  token 
or  make  a  reservation.  A  station  that  seizes  a  to¬ 
ken  enters  TRANSMIT  mode  and  transmits  its  mes¬ 
sage.  After  transmitting  the  message,  a  station  in 
TRANSMIT  mode  waits  for  its  data-frame  to  come 
back.  Then  it  converts  it  into  a  token-frame  and 
returns  to  repeat  mode. 

Recall  also  that  the  superscript  denotes  a 
value  of  a  field  in  the  received  token  or  data-frame, 
whereas  the  superscript  “-f  ”  denotes  the  value  of  a 
field  in  the  transmitted  frame.  In  transmit  mode, 
R2  has  an  intermediate  value,  max(/2j,  Pm),  which 
is  determined  after  RJ  is  known,  but  is  changed  (to 
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•  in  REPEAT  mode: 

R*  ♦-max(/ij',  Pm) 

P+  mn{V[Rt],  P-) 

if  (P'*'  <  Pm)  A  (P"*"  =  P“)  A  {T~  =  0)  then  (*  the  token  is  seized  *) 

r+  ^  1 
-  0 

change  mode  to  TRANSMIT 

transmit  Destination,  Source  and  Data  fields 

else  Pj  <—  max(Pj ,  Pm) 

update  the  local  vector  V  (as  shown  below) 

•  in  TRANSMIT  mode: 

P3  ♦-  max(Pj ,  Pm) 
r+  ^  0;  P?"  Pa,  Pa  ^  0 
if  Pa  >  P“  then 

P+  -  Pa 

Vr  €  [0,  P-  -  1]  do  V[rl  -  min(Vlr),  P") 

Vr  €  (P-,  P+  -  1)  do  V[r)  -  r 
else  P+  -  min(V[P?-l,  P") 

change  mode  to  REPEAT  and  update  the  local  vector  V  (as  shown  below) 

•  Updating  The  Local  Vector  V 

Vr  €  (P+,  7]  do  V[r]  -  7 
r  -  P+  -  1 

while  (r  >  0)  A  (V[r]  ^  r)  do 

V[r]  -  7 
r  —  r  -  1 


Table  1;  Station  Algoiithm  Upon  Receiving  a  Token-  or  a  Data-Fiame 


0)  when  P^  is  transmitted.  This  value  is  occupied 
in  a  local  variable,  called  P3. 

5  Main  Properties  of  the  New 
Mechanism 

The  present  section  outlines  the  main  properties  of 
the  new  mechanism,  as  stated  and  proved  in  [3]. 

•  The  algorithm  in  repeat  mode  can  be  per¬ 
formed  with  one-bit-delay. 

•  The  protocol  ensures  liveness  in  the  following 
sense:  (a)  within  at  most  1  round-trip  (instead 
of  up  to  V  round-trips  in  the  IEEE-802.5  stan¬ 
dard)  after  a  token  is  issued,  either  one  of  the 


stations  seizes  it  and  transmits  its  message,  or 
the  priority  field  P  of  the  token  contains  the 
value  of  the  maximal  reservation;  (b)  the  to¬ 
ken  cannot  circulate  the  ring  for  more  than  2 
consecutive  round-trips,  (instead  of  P  -b  1  in 
the  standard  protocol),  if  some  messages  are 
waiting  for  transmission. 

•  The  protocol  ensures  fairness  in  the  following 
sense:  (a)  messages  with  higher  priority  are 
sent  first;  (b)  if  station  i  seizes  a  token  and 
transmits  a  waiting  message  with  priority  p  at 
the  time  when  station  j  has  a  waiting  message 
with  the  same  priority,  then  j  will  get  an  op¬ 
portunity  to  transmit  its  message  before  t  will 
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get  anothei  oppottunity  to  transmit  a  message 
with  priority  p. 

6  Conclusion 

The  paper  has  addressed  the  issue  of  station  de¬ 
lay  in  ring  networks.  It  has  explained  why  one- 
bit-delay  is  the  minimum  possible  delay  at  a  ring 
station  and  shown  that  the  station  delay  depends 
on  the  Medium  Access  Control  protocol  executed 
in  the  ring. 

Then,  the  paper  has  introduced  the  distributed 
priority  mechanism  for  token-rings  as  approved  by 
the  IEEE-802.5  standard.  We  have  shown  that 
due  to  the  computation  restrictions  imposed  by  the 
one-bit-delay  requirements,  this  mechanism  leads 
to  loss  of  bandwidth  and  starvation. 

The  paper  has  presented  a  new  priority  mecha¬ 
nism.  The  new  mechanism  retains  all  the  desired 
properties  of  the  standard:  it  can  be  executed  by 
the  ring  stations  with  one-bit-delay  and  it  ensures 
liveness  and  fairness.  However,  in  the  new  protocol 
when  P  >  R  holds,  P  is  reduced  to  R  in  at  most  1 
round-trip  rather  than  up  to  T  round-trips.  This 
minimizes  the  loss  of  bandwidth  and  enables  low 
priority  messages  to  be  served  much  faster. 
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Connection-Based  Communication  in  Dynamic  Networks 

Extended  Abstract 

Amir  Herzberg* 


Abstract 

We  analyze  and  improve  the  fault  tolerance  of  prac¬ 
tical,  efficient  end  to  end  communication  schemes. 
We  concentrate  on  connection-based  source  routing 
schemes,  used  in  most  existing  wide-area  networks, 
e.g.  in  SNA/APPN.  These  schemes  are  composed 
of  three  components:  a  topologg  update  protocol,  a 
route  selection  algorithm  and  a  connection  protocol. 
The  topology  update  protocol  maintains  an  approx¬ 
imation  of  the  network  topology  at  every  processor. 
The  route  selection  algorithm  in  the  source  processor 
uses  the  topology  approximation  to  select  the  ‘best’ 
route  to  the  destination.  The  coimection  protocol 
sends  messages  along  this  route.  We  make  the  fol¬ 
lowing  contributions: 

•  Explicitly  expressing  the  high  efficiency  possi¬ 
ble  with  connection-based  schemes.  This  effi¬ 
ciency  is  well  appreciated  in  practice,  but  has  not 
been  analyzed  formally  so  far.  Roughly  speak¬ 
ing,  both  communication  and  time  complexities 
are  in  the  order  of  the  shortest  route  from  source 
to  destination  which  is  up  for  ‘long  enough’. 
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•  Understanding  the  tolerance  of  the  scheme  used 
in  SNA/APPN.  Roughly,  this  scheme  operates 
efficiently  if  the  mean  time  between  failures 
(MTBF)  of  the  network  is  at  least  the  number  of 
processors.  The  scheme  may  fail  for  lower  MTBF 
even  if  a  route  from  source  to  destination  never 
fails. 

•  A  new  route  selection  algorithm  which  ensures 
efficient  operation  even  when  failures  are  fre¬ 
quent  (low  MTBF),  as  long  as  some  route  from 
source  to  destination  is  up  for  sufficiently  long. 

•  Formalizing  the  notion  of  MTBF. 

•  A  crash-tolerant,  self-stabilized  connection  pro¬ 
tocol  with  bounded  storage  and  messages. 

1  Introduction 

The  usual  form  of  communication  in  networks  is  end 
to  end,  i.e.  data  from  a  source  processor  is  transmit¬ 
ted  to  a  destination  processor.  We  consider  this  task 
in  the  setting  of  point-to-point  wide  area  networks 
based  on  packet  switching,  e.g.  SNA  [BGJ'''85].  We 
are  interested  in  the  resiliency  to  failures.  Formally, 
we  use  the  dynamic  networks  model  of  [AE86],  i.e.  an 
asynchronous  message-passing  network  G(  V,  E)  with 
general  topology  in  which  links  fail  and  recover. 

All  practical  end  to  end  protocols  are  based  on 
sending  every  packet  along  a  single  route  from  source 
to  destination.  The  protocols  differ  in  their  strategies 
for  selecting  this  route  and  ensuring  FIFO  and  deliv¬ 
ery.  However,  all  protocols  try  to  use  a  ‘short’  (i.e. 
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efficient)  route. 

We  concentrate  on  protocols  which  are  based  on 
connections  (also  called  sessions  or  virtual  circuits). 
In  such  protocols,  the  communicaticm  is  done  by 
‘setting-up’  an  ‘efficient’  route  between  the  source  and 
the  destination,  and  only  then  communicating  along 
this  route.  This  is  efficient  if  much  data  is  communi¬ 
cated  over  the  connection,  since  most  routing  infor¬ 
mation  can  be  sent  just  once  during  set-up,  and  not 
with  each  piece  of  data.  If  only  one  message  is  sent, 
it  is  more  efficient  to  use  ‘connection-less’  schemes. 

We  further  concentrate  on  sonrce-rouiing  schemes, 
in  which  the  source  processor  selects  the  route 
to  the  destination  based  on  some  topology  data. 
Ck>nnection-ba8ed  communication  with  source  rout¬ 
ing  is  widely  used  in  practice,  for  example  in  SNA 
[BGJ-^SS,  Gro82]  and  in  Codex  [BG87]. 

Connection  based  source  routing  schemes  are  com¬ 
posed  of  three  components:  a  topology  update  proto¬ 
col,  a  route  selection  algorithm  and  a  connection  pro¬ 
tocol.  The  topology  update  protocol  maintains  an 
approximation  of  the  network  topology  at  every  pro¬ 
cessor.  The  route  selection  algorithm  in  the  source 
processor  uses  the  topology  approximation  to  select 
the  ‘best’  route  to  the  destination.  The  connection 
protocol  sets  up  the  route  and  then  sends  efficiently 
data  messages  along  the  route. 

Our  Contributions 

Our  main  contribution  is  an  explicit  analysis  of 
the  high  efficiency  possible  with  connection-based 
schemes.  Loosely  speaking,  the  communication  com¬ 
plexity  of  our  modification  of  the  SNA/APPN  scheme 
is  in  the  order  of  the  ‘shortest  stable  up’  route  from 
the  source  to  the  destination.  More  precisely,  the  av¬ 
erage  delay  and  the  communication  cost  per  packet 
sent  are  0(1),  where  /  is  the  length  of  the  shortest 
0{\E\  ■  /)—  Up  route  from  source  to  destination.  An 
0{\E\  •  /)— Up  route  consists  of  links  which  ate  up  for 
the  last  0(\E\- 1)  time  units.  This  efficiency  is  well 
appreciated  in  practice,  but  has  not  been  analysed 
formally  so  far. 


Additional  Ckmtributions 

•  Understanding  the  sensitivity  to  failures  of 
the  scheme  used  in  SNA/APPN.  SNA  is  the 
most  common  point  to  point  network,  and 
SNA/APPN  is  its  newer  version.  Hence, 
the  effects  of  failures  on  the  scheme  used  in 
SNA/APPN  are  of  great  practical  importance. 
However,  while  it  is  well  understood  that  in  prac¬ 
tice  SNA/APPN  operates  well  in  spite  of  some 
failures,  this  was  not  analyzed  so  far.  We  give 
precise  sufficient  and  necessary  conditions  for 
efficient  operation  of  the  SNA/APPN  routing 
scheme.  In  particular,  we  show  that  SNA/APPN 
requires  mean  time  between  failures  (MTBF)  of 
at  least  |V|,  even  if  some  route  from  source  to 
destination  never  fails. 

•  A  simple  modification  to  the  route  selection  algo¬ 
rithm,  which  allows  the  SNA/APPN  scheme  to 
operate  efficiently  whenever  the  source  and  the 
destination  are  connected  by  a  route  which  is  up 
for  a  period  of  length  n(|F|  ■  |  Vj).  The  modified 
scheme  also  operates  if  the  MTBF  is  |  V|  or  more. 

•  An  self-stabilized,  crash-tolerant  connection  pro¬ 
tocol,  which  uses  only  bounded  resources  (with¬ 
out  unbounded  counters).  The  protocol  is  very 
simple  and  efficient.  We  present  also  a  sim¬ 
pler  version  which  is  bounded  and  crash-tolerant 
(but  not  self  stabilized).  This  version  uses  the 
same  flows  as  the  existing  connection  protocol 
of  SNA/APPN  [SJ86,  BGJ-'^SS],  and  therefore  is 
very  practical. 

Connection  protocols  in  use  today  are  nei¬ 
ther  self-stabilizing  nor  crash-tolerant  [SJ86, 
BGJ'^85].  A  self-stabilizing  connection  protocol 
was  suggested  in  [Spi89,  Spi90].  All  of  these  pro¬ 
tocols  deal  with  the  simple  case  where  only  one 
route  is  used  between  each  source  and  destina¬ 
tion.  The  only  solution  proposed  to  the  more 
realistic  case  where  many  routes  are  used  is  un¬ 
bounded  counters  or  time-stamps  [Gro82]  (which 
is  not  self-stabilized). 
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A  self-stabilized  (crash-tolerant)  end  to  end 
protocol  is  obtained  by  using  our  connection 
protocol  together  with  a  self-stabilized  (crash- 
tolerant)  topology  update  protocol,  e.g.  [Spi86, 
HS89].  Further  work  is  required  to  analyse  the 
complexities  of  the  resulting  scheme,  however 
they  seem  to  be  inferior  to  the  excellent  com¬ 
plexity  of  the  present  scheme. 

•  A  quantified  definition  of  the  MTBF  of  a  link 
and  a  network.  These  concepts  extend  the  quan¬ 
titative  approach  as  presented  in  [AGH90,  GHSJ. 
Our  work  gives  additional  evidence  that  this  ap¬ 
proach  allows  formal  analysis  with  practical  sig¬ 
nificance. 

Related  Works 

Previous  formal  analysis  of  end  to  end  protocols 
showed  complexities  which  are  a  function  of  the  size 
of  the  network,  not  of  the  length  of  the  shortest  ‘po»- 
sible’  route  from  source  to  destination.  Hence,  these 
protocols  appear  much  less  efficient  than  the  APPN 
protocol,  especial.'  with  the  improvement  suggested 
in  this  work.  We  first  compare  this  work  to  the 
protocols  analyzed  using  the  quantitative  definition 
of  a  /()— Up  route  [AGH90,  GHS].  Both  of  these 
works  suggested  ‘connection-less’  end  to  end  proto¬ 
cols.  However,  these  protocols  are  much  less  efficient 
than  the  one  of  this  paper.  The  protocol  of  [GHS] 
has  communication  complexity  of  0(|£^|),  and  fur¬ 
thermore  requires  that  the  every  two  proce$8or$  will 
be  always  connected  by  a  0(|V'|)— Up  route.  The 
protocol  of  [AGH90]  has  communication  complexity 
0(|£^|)  when  amortized  over  prefixes  of  the  execu¬ 
tion.  Furthermore,  under  the  more  realistic  commu¬ 
nication  complexity  measure  of  [GHS]  (also  Def.  3  be¬ 
low),  which  allows  amortization  over  any  sufficiently 
long  interval,  the  complexity  of  the  end  to  end  in 
[AGH90]  is  exponential.  Note  that  the  broadcast  pro¬ 
tocol  which  is  the  main  result  of  [AGH90,  GHS]  is 
efficient  under  both  measures. 

We  now  compare  our  assumptions  to  the  two  as¬ 
sumptions  used  in  most  existing  formal  works  in 
this  area.  The  eventual  stability  assumption  [Gal76, 


Fin79,  Seg83,  AAG87]  is  that  after  some  finite  but 
unbounded  time,  there  will  be  no  more  failures.  All 
practical  protocols  work  under  this  assumption.  How¬ 
ever,  this  assumption  is  insufficient  to  ensure  any 
progress  concurrently  with  failures.  Hence,  the  com¬ 
plexity  measure  used  in  these  works  is  the  average 
complexity  per  failure,  which  does  not  capture  the  ac¬ 
tual  efficiency  of  connection-based  schemes.  In  real¬ 
ity,  failures  may  be  frequent  [RS91],  and  protocols  are 
designed  to  be  resilient  to  concurrent  failures.  The 
eventual  stability  assumption  does  not  enable  us  to 
measure  this  resilience  to  concurrent  failures. 

The  other  assumption  used  in  many  existing  for¬ 
mal  works  is  that  the  network  is  eventually  connected. 
Loosely  speaking,  two  processors  are  eventually  con¬ 
nected  if  there  is  a  route  between  them  with  links  that 
‘sometimes’  operate  [AE86].  Obviously,  this  is  an  ex¬ 
tremely  weak  requirement.  Unfortunately,  it  is  so 
weak  that  reasonable  solutions  appear  incorrect,  and 
unreasonable  impossibilities  and  lower  bounds  hold. 
In  particular,  to  ensure  communication  between  even¬ 
tually  connected  processors  we  must  try  every  possi¬ 
ble  route.  This  gives  communication  cost  of  n(|J?|)^. 
If  one  allows  unbounded  counters,  then  an  0(|£|)  so¬ 
lution  is  by  flooding  [Vis83].  However,  this  is  still 
too  inefficient  for  practical  use.  Clearly,  the  much 
less  efficient  solutions  without  unbounded  counters 
[AG88,  AMS89,  AG91,  AGR92]  are  impractical. 

2  Definitions 

2.1  The  Dynamic  Networks  Model 

We  consider  the  dynamic  network  model  of  [AAG87, 

AE86].  The  network  is  represented  by  an  undi- 

def 

rected  graph  G{V,  E),  with  n  =  jVj  processors  and 

def 

m  =  jEj  links.  Each  link  (u,  v)  enables  direct  com¬ 
munication  between  processor  u  and  v;  we  say  that 
u  and  V  are  neighbors.  There  is  no  assumption  about 
the  topology  of  the  griq>h,  and  the  topology  is  arpriori 
not  known  to  the  processors.  However,  the  processors 
know  n,  and  each  processor  has  a  distinct  identity. 

'  However,  [AGR93}  ibow  that  if  the  data  tranemitted  k 
very  Imig,  coat  of  0(|V|)  k  poaaible! 
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The  communication  over  each  link  satisfies  the  iaU 
reliaHlitf  properties  as  defined  in  [BS88].  The  main 
property  is  that  FIFO  is  preserved.  Most  properties 
deal  with  periods  of  time  when  the  link  is  ‘up’.  The 
link  is  sp  at  a  processor  from  a  recovery  until  its  next 
failure. 

We  assume  total  order  between  events.  We  also 
associate  a  positive  number  ftme(e)  with  each  event 
e.  The  number  ftme(e)  represents  the  ‘normalized 
time’  of  event  e.  Namely,  the  transmission  delay  is 
at  most  one  time  unit,  regardless  of  the  amount  of 
current  traffic.  The  use  of  the  total  order  and  of  the 
time  is  only  for  the  snke  of  analysis,  and  completely 
transparent  to  the  protocol.  In  particular,  the  pro¬ 
cessors  are  not  aware  of  the  ‘time’  of  events  during 
the  execution.  Hence,  the  model  is  completely  asyn¬ 
chronous. 

2.2  Quantitative  and  Dynamic  Link 
Properties 

The  main  idea  of  the  quantitative  approach  enunci¬ 
ated  in  [AGH90,  GHS]  is  that  for  some  purposes,  it 
may  be  required  that  the  link  be  up  for  some  period 
continuously.  This  is  easily  expressed  by  the  defini¬ 
tion  below  of  a  link  being  r— Up,  which  essentially 
refines’  definition  2.1  from  [AGH90]. 

Definition  1  We  ssy  that  link  (u,  v)  ia  r— Up  at  time 
t,  if  (u,v)  ts  up  at  hath  v  and  u  during  the  entire 
interval  [max(/  —  r,  0),  f]. 

Note  that  v  knows,  at  any  moment,  if  (u,  v)  is  up 
at  V.  However,  processor  v  does  not  know  if  (u,  v)  is 
currently  r— Up.  One  reason  is  that  the  network  is 
asynchronous  and  hence  v  cannot  detect  when  r  time 
units  have  elapsed.  Another  reason  is  that  processor 
V  does  not  know  if  (u,  v)  is  up  at  u. 

We  now  present  a  new  definition  which  f(»maliBes 
the  mean  time  between  failures,  at  MTBF,  of  a  link  or 
of  the  network  at  a  given  time  period.  It  seems  that 
many  practical  protocols  are  sensitive  to  the  MTBF. 

^The  tenn  r— Up  was  nicgested  by  David  Peleg,  hiaiead  of 
r-RdiaUe  ia  [AOHM]. 


Some  networks  even  contain  mechanisms  to  delay  re¬ 
coveries  in  order  to  ensure  high  MTBF  [RS91]. 

Definition  2  A  link  (u,  v)  has  MTBF  of  ft  during 
[fiifs]  if  the  number  of  failures  o/(u,v)  during  [fi.fs] 
is  at  most  .  Similarig,  the  network  has  MTBF  of 
ft  during  [fi,!]]  if  the  total  number  of  failures  during 
is  at  most 

2.3  Communication  Complexity 

We  found  it  crucial  to  the  understanding  of  practical 
protocols  to  use  a  definition  of  communication  com¬ 
plexity  that  averages  the  communicati<Mi  over  inter¬ 
vals  with  specific  properties,  for  example  long  enough. 
This  definition  is  taken  from  [GHS],  and  see  there 
more  elaborate  motivations. 

The  communication  complexity  is  defined  with  re¬ 
spect  to  a  predicate  P  of  intervals.  Loosely  speak¬ 
ing,  C  is  the  communication  complexity  of  a  protocol 
with  respect  to  predicate  P,  if  in  every  execution,  the 
number  of  packets  received  by  the  protocol  in  every 
interval  which  satisfies  P  is  at  most  C. 

The  function  C  depends  on  the  size  of  the  network 
and  on  the  number  of  packets  accepted  for  transmis¬ 
sion  from  the  higher  layer  during  the  interval.  In 
dynamic  networks,  C  depends  also  on  the  number  of 
failures  and  recoveries  during  the  interval. 

S(Hne  or  all  of  the  packets  sent  and  received  during 
a  interval  may  be  due  to  packets  accepted,  failures 
and  recoveries  that  have  occurred  before  this  inter¬ 
val.  The  communication  complexity  therefore  con¬ 
siders  also  the  events  during  a  certain  interval  before 
the  measured  interval.  To  reduce  notations,  we  re¬ 
quire  that  both  intervals  are  of  the  same  length. 

Definition  3  We  sag  that  function  C  m  the  commu- 
nicstion  complexity  of  the  protocol  over  intervals  satis¬ 
fying  P,  if  ta  every  execefion,  the  number  of  receive 
events  during  say  interval  (<,<-(-  x]  satisfying  P  is  at 
most 

C(n,  m.  A,  F,  R) 

wkere; 
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A  The  number  of  messages  Accepted 

in  the  source  during  [t  —  x,t  +  x]. 

R  The  number  of  Recover  events 

during  [<  —  !,<  +  *]. 

F  The  number  of  Fail  events 

during  [i  —  +  x]. 

If  C(n,m,A,F,R)  =  A  •  CxCn,  m)  +  F  •  C'F(n,m) 
then  we  say  that  the  communication  is  Ca  per  accept 
and  Cf  per  failure. 

Restricting  the  length  of  packets.  The  defini¬ 
tion  above  counts  only  the  number  of  events,  without 
taking  into  account  the  size  of  the  packets  involved. 
To  justify  this,  we  assume  that  each  packet  may  con¬ 
tain  only  one  of  the  following:  a  processor  identity,  a 
O(logn)  length  field,  and  a  ‘counter’. 

2.4  Throughput 

Intuitively,  the  throughput  is  the  rate  at  which  the 
higher  layer  may  send  packets  using  the  protocol.  For 
simplicity,  we  assume  that  the  higher  layer  always  has 
packets  to  send.  This  definition  is  from  [GHS],  where 
it  is  given  without  this  assumption. 

Like  communication  complexity,  the  throughput  is 
traditionally  measured  for  the  worst  case  prefix  of 
an  execution  of  the  protocol.  As  for  communication 
complexity,  we  see  no  apparent  reason  to  consider 
only  prefixes  of  executions,  which  may  hide  impor¬ 
tant  transient  effects.  Instead,  we  propose  to  consider 
the  throughput  of  any  interval  of  an  execution  of  the 
protocol. 

Definition  4  We  say  that  function  T  is  the  through¬ 
put  of  the  protocol  over  intervals  satisfying  predicate  P, 
if  in  every  execution,  the  number  of  packets  delivered 
at  the  destination  daring  any  interval  x]  satis¬ 
fying  P  is  at  least  x  •  T(n,  m). 

3  Analyzing  and  improving 
the  SNA/APPN  Scheme 

Connection-based  schemes  are  composed  of  three 
components:  a  topology  broadcast  protocol,  a 


route  selection  strategy  and  a  connection  protocol. 
The  topology  broadcast  protocol  maintains  topology 
databases  in  the  processors,  trying  to  keep  them  as 
close  to  the  actual  topology  as  possible.  Whenever  a 
source  processor  wishes  to  communicate  with  a  des¬ 
tination  processor,  the  route  selection  strategy  in  the 
source  uses  its  topology  database  to  select  the  ‘best’ 
route  to  the  destination.  Then,  the  connection  pro¬ 
tocol  tries  to  send  messages  over  this  route  to  the 
destination.  For  efficient  transmission  of  many  mes¬ 
sages  over  the  route,  the  connection  protocol  sends 
the  description  of  the  route  only  in  the  first  few  con¬ 
trol  messages,  used  to  set  up  the  route.  The  route 
selection  strategy  may  also  pick  a  new  route  because 
of  f^ure  or  otherwise.  Then,  the  connection  protocol 
takes  down  the  old  route  and  sets  up  the  new  route. 

3.1  The  SNA/APPN  Scheme 

In  this  subsection,  we  investigate  the  communication 
scheme  used  in  SNA/APPN  [BGJ+85,  JBS86].  Like 
other  practical  schemes,  this  scheme  is  based  on  the 
use  of  ‘unbounded  counters’  to  identify  the  order  be¬ 
tween  packets.  This  greatly  simplifies  the  implemen¬ 
tation  of  the  topology  broadcast  and  connection  pro¬ 
tocols. 

The  topology  update  protocol  in  APPN  is  based 
on  flooding.  Each  topology  change,  once  detected,  is 
sent  to  all  neighbors  with  a  sequence  number  and  the 
identity  of  the  processor  that  detected  the  topology 
change.  Whenever  a  processor  receives  a  packet  de¬ 
scribing  some  topology  change  detected  by  another 
processor,  it  compares  the  sequence  number  in  this 
packet  to  the  last  sequence  number  it  received  from 
that  processor.  If  the  packet  received  contains  a 
smaller  or  equal  number  to  the  one  stored,  then  this 
packet  is  ignored  (being  old).  Otherwise,  the  packet 
is  sent  to  all  neighbors,  the  topology  estimate  is  up¬ 
dated  and  the  sequence  number  of  changes  from  that 
processor  is  updated. 

This  simple  protocol  ensures  that  the  source  will  be 
updated  about  the  state  of  every  processor  connected 
to  it  by  a  route  which  is  up  for  ‘long  enough’,  as 
formalized  in  the  theorem  below. 
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Theorem  1  There  w  «  iopohn  *pi*ie  protocol  with 
eommunieotion  complexiip  0(m)  per  change  for  ta-- 
tervoU  of  length  n  nni  with  the  following  propertg.  If 
procesaore  u  «a<(  v  are  coaaeclei  hg  aa  (/  +  2)—  Up 
route  of  length  I  ut  time  t,  then  the  atute  of  ihe  linka 
of  V  declared  bg  the  protocol  in  u  at  time  t  u  the  actual 
state  tn  v  at  some  time  during  [<  —  (/  +  2),t].  Fur¬ 
thermore,  the  number  of  topologg  changes  in  ang  link 
declared  bg  the  protocol  in  u  during  [t  —  r,  1]  u  at  most 
the  number  of  actual  changes  during  [t  — (/+2)— r,<]. 

Proof:  See  algorithm  2.3b  of  [JBS86].  O 

The  estimate  produced  by  the  topology  update  pro¬ 
tocol  is  used  in  APPN  to  select  the  shortest  operating 
route  to  the  destination.  Whenever  a  shorter  route 
becomes  operating,  then  it  is  used  instead.^ 

The  selected  route  is  used  by  the  connection  proto¬ 
col  to  send  the  data  messages.  Each  message  is  sent 
in  a  packet,  encapsulated  together  with  control  infor¬ 
mation  which  allows  the  packet  to  be  forwarded  to. 
the  destination.  This  control  information  should  be 
short  in  order  to  ensure  efficient  communication.  This 
is  done  by  sending  the  description  of  the  route  only 
with  a  set-up  packet  at  the  beginning  of  the  use  of 
the  route.  When  an  intermediate  processor  receives 
the  set-up  packet,  it  stores  the  identity  of  the  next 
neighbor  toward  the  destination,  and  an  identifier  of 
this  connection.  After  this  set-up  process,  each  data 
packet  contuns  only  this  identifier.  APPN  uses  the 
set-up  procedure  of  [SJ86],  which  allows  a  short  iden¬ 
tifier,  and  requires  2/  to  complete  where  /  is  the  length 
of  the  route. 

Unbounded  counters  are  used  only  in  the  set-up 
process.  The  source  appends  a  connection  counter, 
which  is  used  to  discard  old  connections  in  favor  of 
newer  connections  [Gro82].  In  the  next  section  this 
unbounded  counter  is  eliminated. 

While  sending  data  packets,  APPN  uses  a  win-. 

'^In  some  actoal  im|rfaiieiit*tioaa  of  APPN,  a  route  ie  used 
until  it  faib.  This  m^r  cause  the  use  of  aa  inelfideat  route 
until  it  fails.  We  analyse  above  the  improvement  of  [Qro8^, 
where  the  shcntest  route  is  always  used  (if  necessary,  takias 
down  intentionally  the  operating  route).  The  only  advantage 
we  found  tor  the  other  method  is  that  it  is  robust  to  frequent 
recoveries. 


dow,  i.e.  allows  several  data  packets  to  be  f(»warded 
along  the  route  at  the  same  time.  This  ensures  high 
throughput  (0(1))  as  long  as  the  same  route  is  used. 
For  nmplicity,  assume  every  hop  (link)  contains  at 
most  one  packet.  Below  we  summarize  the  properties 
of  the  connection  protocol  used  in  APPN. 

Theorem  2  There  is  a  connection  protocol  which 
ensures  that  messages  are  alwags  delivered  in  the  or¬ 
der  in  which  theg  were  received  (FIFO).  Also,  assume 
fksf  during  [f,f -f  x],  a  specific  route  of  length  I  s.t. 
X  >  61  is  selected  for  this  connection  protocol.  Then 
the  communication  is  at  moat  31  per  accept  and  l^  per 
failure  of  a  link  in  this  route.  If  the  route  is  up  during 
[f,f +  x],  then  the  throughput  is  0(1)  during  [t,t-|-z]. 
Both  complexities  are  for  intervals  of  length  61. 

Pto€)£  Sketch:  The  FIFO  fi^ows  from  [SJ86].  If  the 
route  does  not  fail,  the  set-up  costs  are  amortized  and 
negligible.  If  the  route  fails,  the  costs  are  attributed 
to  the  failures.  O 

As  mentioned  above,  APPN  uses  topology  update 
according  to  Theorem  1,  selects  always  the  shortest 
route,  and  uses  the  connection  protocol  of  Theorem 
2. 

Theorem  3  Consider  a  connection-baaed  communi¬ 
cation  scheme  where  the  shortest  route  is  alwags  se¬ 
lected  and  with  topologg  update  and  connection  proto¬ 
cols  aatisfging  Theorems  1  and  S,  respectivelg.  Then 
the  communication  complexitg  is  0(1)  per  message 
accepted  and  0(m)  per  failure,  and  the  throughput  is 
0(1),  for  intervals  (t,t+x]  aatisfging:  (a)  x  >  24-/-m, 
(b)  during  [f,f  +  x]  the  MTBF  of  the  network  is  at 
least  61  and  (c)  during  [f  —  2n,f +x]  there  is  a  31— Up 
route  of  length  I  or  less  from  source  fo  destination. 

Proof  Sketch:  Since  the  shortest  route  is  always 
used,  then  from  Theorem  1  all  routes  used  during 
[t  —  n,t  -f  z]  are  of  length  /  or  less.  Since  the  MTBF 
during  [t,t  -t-  x]  is  at  least  61,  it  follows  that  there 
would  be  at  most  ^  failures  during  (t,t  +  x].  Hence 
there  would  be  at  most  +m  recoveries.  Since  routes 
an  switched  only  up<Hi  a  failure  or  a  recovery,  there 
are  at  most  2-ff+m  route  switchings  during  (f,f-f-z]. 
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Hence,  from  Theorem  2  and  nmple  arithmetics,  the 
number  of  messages  delivered  is  at  least  —ml,  from 
which  the  throughput  follows. 

To  compute  the  communication  complexity,  we 
consider  only  the  connection  protocol  ^ce  Theo¬ 
rem  1  already  shows  that  the  topology  update  con¬ 
tributes  at  most  0(m)  per  failure.  Furthermore,  we 
are  concerned  only  with  the  set-up  and  take-down 
costs  which  are  both  at  most  0(P),  since  successful 
transmissions  require  only  /  send  events.  As  argued 
above,  there  are  at  most  2  •  ^  -f  m  route  switchings, 
during  (f ,  f  -1-  x],  and  at  least  ^  —  m/  deliveries.  The 
claim  follows  by  arithmetics.  Q 

3.2  Improving  the  route  selection 

Theorem  3  shows  that  APPN  ‘operates  well*  if  the 
MTBF  is  61  or  more.  If  the  MTBF  is  substantially 
smaller,  e.g.  I,  then  APPN  does  not  work  even  if  there 
is  a  route  which  never  fails  from  source  to  destination 
with  length  /.  This  is  passible  even  if  each  link  fails 
only  rarely.  In  this  concise  version  we  omit  the  de¬ 
tails  of  this  (simple)  construction.  We  now  present 
a  simple  modification  to  the  route  selection,  which 
ensures  operation  whenever  the  source  and  the  desti¬ 
nation  are  connected  by  a  route  which  is  up  for  long 
enough. 

The  kind  of  scenario  we  wish  to  avoid  is  selecting 
alternately  one  of  a  pair  of  short  but  unstable  routes,, 
while  a  longer  route  is  operating  all  the  time.  On 
the  other  hand,  we  should  use  a  short  route  that  has 
recovered  and  is  up  for  ‘long  enough*. 

A  simple  solution  is  to  try  the  routes  by  order  of 
increasing  length,  without  trying  again  a  link  that 
failed.  After  trying  all  possible  routes,  or  after  trying 
for  ‘long  enough*,  then  we  re-start  the  process  from 
the  shortest  route. 

Suppose  there  is  a  route  r  which  is  operating  all  the 
time,  with  length  /.  We  try  at  most  m  routes  before 
r,  since  we  never  try  a  link  that  failed.  If  the  route  r 
is  operating,  we  use  it  to  deliver  ‘enough*  messages  to 
compensate  for  the  time  and  communication  spent  <mi 
trying  other  routes,  in  order  to  ensure  good  averaged 
complexities.  The  time  spent  is  at  most  ml,  and  we 


try  to  get  amortised  throughput  oi  1;  the  crnmnunica- 
tion  spent  is  at  most  mP,  and  we  try  to  get  amortised 

1.  Hence,  it  is  sufficient  to  ensure  that  the  total  num¬ 
ber  of  deliveries  since  last  re-start  with  the  shortest 
route  is  at  most  ml. 

To  summarise: 

1.  The  algorithm  starts  a  ‘cycle*  by  trying  to  use 
the  shortest  route  which  is  up  according  to  the 
topology  data. 

2.  Whenever  the  route  in  use  fails,  we  use  the  short¬ 
est  route  whose  links  did  not  fail  during  this  cy¬ 
cle. 

3.  We  start  a  new  cycle  in  one  of  the  following  cases: 

•  when  all  routes  from  source  to  destination 
contain  a  link  that  failed  in  the  current  cy¬ 
cle,  or 

•  after  delivering  ml  messages  during  a  cycle, 
where  1  is  the  maximal  length  of  a  route 
used  in  this  cycle. 

This  scheme  works  under  very  similar  conditions  to 
these  in  Theorem  3;  we  omit  the  proof.  Furthermore, 
it  also  works  when  there  is  a  sufficiently-up  route  from 
source  to  destination,  without  bounding  the  MTBF. 
This  is  shown  in  the  following  theorem. 

Theorem  4  Consider  a  communication  scheme 
where  routes  are  selected  as  described  above  and  with 
topologjf  update  and  connections  protocols  satisfying 
Theorems  1  and  2,  respectively.  This  scheme  has  the 
properties  of  Theorem  S.  Furthermore,  the  commu¬ 
nication  complexity  is  0(1)  per  message  accepted  and 
0(m)  per  failure,  the  throughput  is  0(1)  and  the  av¬ 
erage  delay  is  0(1).  All  the  complexities  are  for  inter¬ 
vals  (f,f-l-x]  s.t.  X  >  6-l-m  and  during  [f— 6nm,f-|-x] 
there  is  a  route  of  length  I  or  less  from  source  to  des¬ 
tination  which  is  6/m—  Up. 

Proof  Sketch:  A  new  cycle  would  begin  at  some  time 
<'€[/  —  6nm,t].  By  inducticm,  every  route  selected 
during  [f',!  •+■  x]  will  be  at  most  of  length  /.  The 
complexities  follow  by  arithmetics  and  Theorems  1 
and  2.  D 
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4  Connection  Protocol 

Out  goal  is  to  provide  m<»e  robust  implementatioos 
of  connection  protocols  satisfying  conditions  similar 
to  Theorem  2.  In  the  first  subsection  we  present  a 
connection  protocol  which  does  not  use  unbounded 
resources  (for  counters).  We  then  show  that  this  pro¬ 
tocol  tolerates  crashes.  The  third  subsection  contains 
the  self-stabilized  extension. 

4.1  A  Connection  Protocol  with 
Bounded  Resources 

The  use  of  unbounded  resources  in  existing  con¬ 
nection  protocols  [Gro82]  is  only  during  the  set-up 
phase.  In  the  set-up  phase,  er.ch  intermediate  pro¬ 
cessor  stores  the  identity  of  the  next  processor  along 
the  route,  so  that  the  data  packets  sent  later  do  not 
have  to  contain  the  description  of  the  entire  route. 
We  now  show  a  new  set-up  phase,  which  does  not  use 
unbounded  counters.  First,  it  is  useful  to  understand 
why  are  unbounded  counters  used  in  [Gro82]. 

The  unbounded  counters  are  used  in  [Gro82]  to  dis¬ 
tinguish  between  old  and  new  connections  between 
the  same  source  and  destination.  Indeec*  works  which 
consider  only  one  connection  do  not  use  unbounded 
counters  [SJ86,  Spi89,  Spi90].  There  are  two  motiva¬ 
tions  for  distinguishing  between  old  and  new  connec¬ 
tions: 

Storage:  Intermediate  processors  have  to  keep  track 
of  all  the  connections  flowing  through  them. 
If  the  storage  in  an  intermediate  processor  is 
bounded,  then  it  must  be  able  to  use  this  storage 
for  new  connections  rather  than  wasting  it  on  old 
connections. 

FIFO:  The  destination  should  preserve  FIFO,  and 
therefore  it  should  never  deliver  messages  from 
old  connections. 

The  use  of  unbounded  counters,  given  sequentially 
to  the  connections,  enable  the  identification  of  a  new 
connection  by  simply  comparing  the  number  of  the 
connection  to  the  largest  connection  number  known. 


We  now  diow  alternative  methods  of  addrearing  the 
two  motivations  above,  without  unbounded  counters. 

The  simpler  solution  is  for  the  storage  (of  interme¬ 
diate  processors).  The  solution  is  to  allow  at  most  one 
connection  over  each  link,  for  each  pair  of  source  and 
destination.  If  a  processor  receives  a  set-up  packet  for 
a  link  which  is  already  allocated  to  a  connection  with 
the  same  source  and  destination,  then  it  delays  this 
packet  until  one  of  the  two  connections  is  taken-down. 
When  a  connection  is  taken  down,  by  the  source  or 
by  a  failure,  then  a  ‘take-down’  packet  propagates  all 
over  the  route.  This  packet  enables  the  intermediate 
processors  to  free  the  storage  allocated  to  this  con¬ 
nection. 

The  crucial  point  is  that  at  most  0(n)  time  is  re¬ 
quired  to  wait  for  the  old  connections  to  be  taken 
down.  The  time  required  is  the  time  for  detection 
of  failures  (which  is  assumed  to  be  one  time  unit) 
plus  the  propagation  time  of  the  take-down  packets 
along  the  route  (one  time  unit  per  link).  In  fact,  if 
during  the  last  n  time  units  all  the  routes  used  have 
been  of  length  /  or  less,  then  the  time  is  only  0{l). 
This  is  gives  the  efficiency  required  by  Theorem  2  and 
thereby  to  implement  Theorem  4. 

We  now  consider  the  other  motivation,  namely  pre¬ 
serving  FIFO  order  among  the  messages  delivered. 
This  problem  concerns  only  the  destination.  In  this 
case,  the  destination  receives  the  set-up  packets  of 
a  connection.  Before  delivering  messages  from  this 
connection,  the  destination  should  verify  that  this 
connection  is  indeed  newer  than  the  last  connection 
from  which  the  destination  delivered  messages.  This 
is  since,  due  to  failures  and  uneven  delays,  an  older 
connection  may  be  received  at  the  destination  only 
after  a  newer  connection.  We  have  to  test  the  ‘fresh¬ 
ness’  of  a  connection  when  its  set-up  packet  arrives 
at  the  destination. 

The  destination  tests  freshness  of  a  route  by  send¬ 
ing  along  it,  to  the  source,  another  control  packet 
called  route~OK.  The  source  ignores  route-OK  from 
old  connections  (we  later  show  how  to  identify  route- 
OK  packets  which  are  from  old  connections  that  use 
the  existing  route).  When  the  source  receives  route- 
OK  from  the  existing  connection,  it  starts  sending 
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along  the  route  the  data  packets.  The  data  pack-, 
ets  are  forwarded  toward  the  destination,  unless  the 
route  is  taken  down  (then  they  are  discarded).'* 

We  now  explain  how  the  destination  can  determine 
if  the  data  packets  are  new.  Ck>nsider  two  routes  for 
which  corresponding  set-up,  route-OK  and  data  pack¬ 
ets  were  received.  Theorem  5  below  shows  that  the 
order  of  the  routes  in  the  source  is  the  same  as  the  or¬ 
der  between  the  time  when  the  corresponding  set-up 
packets  were  received  at  the  destination.  This  is  illus¬ 
trated  in  Fig.  1.  Note  that  the  theorem  holds  both 
for  the  destination  and  for  any  intermediate  processor 
V.  Hence,  intermediate  processors  may  also  use  this 
indication  of  freshness,  although  it  is  not  necessary 
(see  above). 


(established) 


Theorem  5  Assume  that  processor  v  received  corre¬ 
sponding  sei-up,  rouie-OK  and  data  packets  of  two 
routes  r,!^.  Then  the  order  between  the  times  when 
the  source  sent  the  two  set-up  packets  is  the  same  as 
the  order  between  the  times  at  which  v  received  them. 

The  protocol  above  requires  the  source  to  identify 
the  route-OK  packet  which  was  sent  in  response  to 
the  last  set-up.  We  call  a  route-OK  packet  stale  if 
it  was  not  sent  by  the  destination  upon  receiving  the 
last  set-up  packet  sent  by  the  source.  Our  goal  is  to 
allow  the  source  to  ignore  stale  route-OK  packets. 

*The  protocol  uses  exactly  the  tame  flowa  as  the  connec¬ 
tion  set-up  protoc<d  of  [SJ86],  which  is  used  in  SNA/APPN 
[BGJ'^SS]  by  appending  unbounded  counters  [Gro82].  We  con¬ 
jecture,  and  hope  to  {wove,  that  indeed  the  same  actual  flows 
could  serve  both  purposes. 


By  simply  including  the  route  in  the  route-OK 
packet,  the  source  can  identify  a  stale  packet,  if  each 
connection  uses  a  different  route.  We  now  show  how 
to  enable  the  source  to  identify  a  stale  route-OK 
packet,  even  if  the  same  route  was  used  before.  The 
problem  is  that  when  the  source  issues  the  set-up 
packet,  stale  route-OK  packets  may  be  on  transit  on 
the  same  route,  and  they  should  be  ignwed.  We  solve 
this  by  ensuring  the  following: 

Invariant:  every  processor  ignores,  and  does  not 
forward,  any  stale  route-OK  packet  after  it  receives  a 
newer  set-up  packet  for  the  same  route. 

This  invariant  is  trivially  kept  by  the  destination. 
To  keep  the  invariant  by  all  processors,  we  introduce 
another  control  packet,  local-ack.  Every  processor 
sends  local-ack  immediately  after  receiving  a  set-up 
packet,  to  the  neighbor  which  sent  the  set-up  packet. 
Clearly,  a  route-OK  packet  received  before  the  local- 
ack  to  the  last  set-up  must  be  stale  and  is  ignored. 
If  the  link  fails  and  the  set-up  or  loc^-ack  packets 
are  thus  lost,  then  the  route  is  taken  down  from  both 
ends  and  any  subsequent  route-OK  will  be  irrelevant. 
By  induction  from  the  destination,  a  stale  route-OK 
cannot  be  received  after  the  local-ack  to  the  last  set¬ 
up.  Therefore,  the  local-ack  mechanism  implements 
the  invariant. 

4.2  Crash  Tolerance 

Processor  crashes  are  failures  where  the  processor 
loses  its  memory  and  re-starts  from  some  predefined 
initial  state  [BS88].  The  crash  of  the  processor  also 
causes  the  failure  of  all  of  its  communicati<Mi  links. 
We  now  show  that  the  response  of  the  protocol  above 
to  these  link  failures  is  sufficient  to  handle  the  crash. 
Our  discusuon  assumes  that  the  data-Unk  is  crash- 
tolerant,  namely  that  all  of  the  properties  of  the  link 
hold  as  if  the  crash  was  a  link  failure.  This  may  be 
achieved  if  one  bit  per  link  is  non-volatile  [BS88]  or 
by  explicitly  allowing  probability  of  error  [GHM89]. 

An  obvious  response  to  a  crash  in  any  processor  is 
that  all  the  connections  flowing  through  that  proces¬ 
sor  are  taken-down  and  restarted.  This  holds  also  for 
the  source  and  destination  processors  of  a  cmmection. 
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Hence,  cradi-tderance  mvolvee  only  the  eet-np  and 
the  take-down  phases  of  the  protocol.  Forthernum, 
no  special  function  is  required  for  the  take-down,  as  it 
follows  immediately  fr(»n  the  link  failure.  It  is  easy  to 
verify  that  the  connection  protocol  using  the  set-up 
phase  described  above  is  also  tolerant  to  crashes,  with 
only  the  obvious  response  above.  Note  that  there  is* 
no  danger  for  confusion  between  connection  set-up 
following  the  recovery  and  ‘old’  connections.  The  rear 
son  is  that  after  the  crash  all  the  ‘old’  connections  are 
taken-down. 

4.3  A  Self-Stabilized  Connection  Pro¬ 
tocol 

For  discussion  and  definitions  of  self-stabilised  pro¬ 
tocols,  see  e.g.  [KPdO].  Our  scdution  assumes  that 
the  data  link  is  self-stabilised,  e.g.  as  in  [AB89],  and 
delivers  at  most  a  constant  number  of  packets  before 
stabilisation.  We  also  assume  that  there  is  a  peri¬ 
odic  iimt-oui  event  in  every  processor.  For  simplicity, 
when  analysing  the  time  complexity  (only)  we  assume 
that  at  most  one  time  unit  passes  between  any  two 
consecutive  time-out  events  in  any  processor.  This 
periodic  event  is  required  for  self-stabilised  protocols- 
in  the  message-passing  model,  as  shown  in  [KP90]. 
Intuitively,  this  is  since  a  processor  may  be  started 
in  a  mode  where  it  just  sent  a  packet  and  expects  a 
reply,  but  the  packet  was  never  actually  sent. 

Our  solution  is  that  the  source  periodically  sends 
a  check  packet  on  the  route  to  the  destination  be¬ 
ing  set-up  or  used  now.  The  check  packet  contains 
the  description  of  the  route.  This  enables  the  pro¬ 
cessors  along  the  route  to  check  that  the  tables  are 
set  properly  (after  set-up  was  completed),  or  to  for¬ 
ward  the  packet  (during  set-up).  The  check  packet 
also  includes  the  previous  packet  sent  along  the  route, 
which  is  compared  to  the  previous  packet  received; 
they  should  be  the  same  since  the  link  is  FIFO.  If 
any  error  is  found,  the  route  is  taken  down.  In  par¬ 
ticular,  the  route  is  taken  down  if  it  is  not  valid,  e.g. 
if  it  has  already  been  taken  down.  If  no  error  occurs, 
the  check  packet  arrives  at  the  destination. 

The  destination  also  sends  periodicaUy  a  ckeck-b*ck 


packet  to  the  source.  This  padwt,  a^in,  contains  the 
route  and  the  previous  padcet  sent,  for  the  same  pur¬ 
poses.  Note  that  this  padcet  is  not  really  an  admowl- 
edgment  to  the  dteck  packet,  e.g.  the  source  cannot 
check  if  it  received  the  e]q;>ected  respcmse  before  the 
check-back  arrives.  The  reascm  is  that  multiple  check 
and  check-back  packets  may  be  on  transit  at  any  mo¬ 
ment,  and  the  source  cannot  identify  the  check-back 
packet  corresponding  to  the  last  check  packet  (except 
randomly  as  in  [GHM89]). 

The  check  packet  is  sent  not  only  every  ‘time-out’ 
period,  but  also  every  n  data  packets.  This  limits  the 
number  of  data  packets  which  are  incorrectly  routed. 
Intermediate  processors  verify  that  a  check  packet  is 
recdved  once  every  n  packets,  and  if  it  is  not  received 
then  the  connection  is  taken  down.  This  removes 
loops  in  the  routing  tables  due  to  a  transient  error. 
Note  that  the  amortised  complexities  are  preserved. 

A  transient  error  may  also  cause  a  processor  to  have 
a  ‘bogus’  connecti<Hi,  i.e.  this  connection  is  not  de¬ 
fined  in  the  other  processors  altmg  the  route.  This 
may  cause  this  processes  to  delay  the  newest  set-up 
packet  forever,  waiting  for  the  newest  route  or  the 
bogus  route  to  go  down.  To  solve  this  problem,  each 
processor  periodically  compares  its  active  and  waiting 
connections  with  its  ‘upstream’  neighbors.  Namely, 
processor  u  sends  to  neighbor  v  the  description  of 
its  connection  where  ti  immediately  follows  v.  This 
enables  v  to  take-down  any  such  bogus  cmmection 
in  the  tables  of  u.  If  the  tables  of  v  also  contain 
the  bogus  connection,  it  will  be  detected  and  taken 
down  by  the  first  processor  upstream  along  the  route 
with  correct  tables,  or  ultimately  by  the  source.  Hiis 
process  requires  0{n)  communication  for  every  link 
of  the  processor  with  the  ‘time-out’  event,  per  each 
pur  of  source  and  destination.  Note  that  with  high 
probability  this  can  be  reduced  to  0(1)  for  s/f  source- 
destination  pairs  together,  by  using  random  hashing. 

Theorem  6  The  connection  protocol  iescrikei  ta 
this  section  sntisfies  Theorem  t.  Furthermore,  for 
say  eseention  with  link  foilnres,  processor  crashes  and 
•rhitmrp  iaifis/  state  of  the  network  at  time  0,  ike 
followinp  holds.  The  seyaeace  of  messages  delivered 


22 


after  0{n)  ia  a  prefix  of  ike  sequence  of  meaaagea 
sent  after  0(n).  Let  I  be  the  (maximal)  length  of  ike 
route  used.  The  communication  after  0(n)  is  0(1) 
per  message,  0(P)  per  failure,  and  0(n*)  per  period¬ 
ical  makeup.  Also,  assume  that  during  [!,<  +  «]>  for 
t  >  2n,  a  specific  route  of  length  at  most  ^  is  selected 
and  is  up,  and  that  during  [<  — n,t]  all  routes  selected 
were  of  length  ^  or  less.  Then  the  throughput  during 
[f,  (  +  z]  M  0(1).  All  complexities  are  for  intervals  of 
length  0(1). 

5  Conclusions  and  Open  Ques¬ 
tions 

We  analyzed  a  simple  protocol  for  end  to  end  com¬ 
munication  in  dynamic  networks.  This  protocol  is 
extremely  efficient  under  the  following  quantitative 
condition:  during  an  interval  of  length  0(nm),  the 
source  and  the  destination  are  connected  by  a  route 
of  length  /  which  is  0(/m)— Up.  The  complexities 
achieved  are  0(1),  which  compares  favorably  with  e.g. 
with  the  simple  lower  bound  of  Cl(m)  on  the  commu¬ 
nication  complexity  possible  assuming  only  eventual 
connectivity.  Is  it  possible  to  require  only  a  shorter 
interval  or  allow  a  route  which  is  up  for  less  time? 
Can  we  find  tradeofis  with  the  MTBF? 

The  result  above  requires  unbounded  counters. 
Can  we  achieve  the  above  with  bounded  resources? 
We  have  shovm  an  implementation  of  a  bounded 
connection  protocol.  Hence,  it  is  enough  to  find  a* 
bounded  implementation  of  topology  update  with  the 
properties  of  Theorem  1. 
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Abstract 

This  paper  studies  the  problem  of  deadlock-free  packet 
routing  in  parallel  and  distributed  architectures.  We 
present  three  main  results.  First,  we  show  that  the 
standard  technique  of  ordering  the  queues  so  that  ev¬ 
ery  packet  always  has  the  possibility  of  moving  to  a 
higher  ordered  queue  is  not  necessary  for  deadlock- 
freedom.  Second,  we  show  that  every  deadlock-free, 
adaptive  packet  routing  algorithm  can  be  restricted,  by 
limiting  the  adaptivity  available,  to  obtain  an  oblivious 
algorithm  which  is  also  deadlock-free.  Third,  we  show 
that  any  packet  routing  algorithm  for  a  cycle  or  torus 
network  which  is  free  of  deadlock  and  which  uses  only 
minimal  length  paths  must  require  at  least  three  queues 
in  some  node.  This  matches  the  known  upper  bound  of 
three  queues  per  node  for  deadlock-free,  minimal  packet 
routing  on  cycle  and  torus  networks. 

1  Introduction 

This  paper  studies  the  problem  of  deadlock-free  packet 
routing  in  parallel  and  distributed  architectures.  A  wide 
range  of  packet  routing  algorithms  with  differing  prop¬ 
erties  and  costs  have  been  proposed  [1,  2,  3,  4,  5,  6,  8, 
9,  10,  12,  13,  14,  15,  16,  17,  19,  20].  In  this  paper  we 
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will  focus  on  a  particularly  simple  and  important  class  of 
routing  algorithms  which  we  will  call  qveue-reaervation 
algorithms.  A  queue-reservation  algorithm  consists  of 
rules  that  specify  to  which  queues  a  packet  may  move 
based  solely  on  the  queue  currently  holding  the  packet, 
the  ptudtet’s  source  node,  and  the  packet’s  destination 
node.  A  packet  is  allowed  to  move  from  its  current 
queue  to  any  other  queue  at  any  time,  provided  that  the 
other  queue  is  empty  and  that  the  move  is  allowed  by  the 
routing  algorithm.  Queue-reservation  algorithms  can  be 
implemented  efficiently  in  hardware  as  they  require  only 
local  information  to  make  routing  decisions,  they  2U'e 
inherently  asynchronous  and  therefore  do  not  require  a 
global  clock,  and  they  do  not  require  the  creation  or 
exchange  of  any  special  packets  containing  only  control 
information.  Furthermore,  adaptive  queue-reservation 
algorithms  allow  packets  to  avoid  congestion,  thus  per¬ 
mitting  high  throughput  in  the  network.  As  a  result 
of  these  advantages,  queue-reservation  algorithms  have 
been  widely  studied  and  implemented. 

The  primary  disadvantage  of  queue-reservation  tech¬ 
niques  is  that  they  require  that  each  node  contain  some 
minimum  number  of  queues.  Although  a  great  deal  of 
research  has  been  devoted  to  the  problem  of  minimiz¬ 
ing  the  storage  requirements  of  queue-reservation  algo¬ 
rithms  [2,  4,  5,  6,  8,  10,  12,  15,  17,  18,  19,  20],  very 
little  is  known  in  terms  of  lower  bounds  on  the  storage 
which  is  required  by  such  algorithms.  Our  goal  in  this 
paper  is  to  characterize  the  properties  which  these  algo¬ 
rithms  must  have  in  order  to  be  free  of  deadlock  and  to 
use  these  properties  to  prove  lower  bounds  on  storage 
requirements. 

One  well-known  technique  for  proving  freedom  from 
deadlock*  is  to  order  the  queues  so  that  every  packet 
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always  has  the  posnbility  of  moving  to  a  higher  or¬ 
dered  queue  [8].  Providing  such  an  ordering  of  the 
queues  is  the  standard  technique  for  proving  free¬ 
dom  from  deadlock  and  has  been  used  by  many  re¬ 
searchers  [2,  4,  5,  6,  8,  10,  12, 15, 17, 19,  20].  Therefore, 
it  seems  plausible  that  the  existence  of  such  an  ordering 
of  the  queues  is  a  necessary  condition  for  freedom  from 
deadlock.  In  fact,  in  the  special  case  of  oblivious  queue- 
reservation  algorithms,  Toueg  and  Steiglitz  have  shown 
that  the  existence  of  such  an  ordering  of  the  queues  ts 
necessary  for  deadlock-freedom  [18].  However,  in  this 
paper  we  will  present  an  adaptive  queue-reservation  al¬ 
gorithm  which  is  provably  free  of  deadlock  and  for  which 
no  ordering  of  the  queues  can  be  defined  such  that  every 
packet  always  has  the  possibility  of  moving  to  a  higher 
ordered  queue.  Thus,  in  the  case  of  adaptive  routing 
the  technique  of  ordering  the  queues  is  sufficient  but 
not  necessary  for  avoiding  deadlock. 

On  the  other  hand,  we  will  prove  that  every  deadlock- 
free,  adaptive  queue-reservation  algorithm  can  be  re¬ 
stricted,  by  limiting  the  adaptivity  available,  to  obtain 
an  oblivious  algorithm  which  is  also  deadlock-free.  As  a 
result,  we  will  be  able  to  use  lower  bounds  on  the  storage 
requirements  of  oblivious  routing  algorithms  to  obtain 
lower  boimds  on  the  storage  requirements  of  adaptive 
routing  algorithms.  In  particulsir,  we  will  show  that 
any  adaptive  queue-reservation  algorithm  for  a  cycle  or 
torus  network  which  is  free  of  deadlock  and  which  uses 
only  minimal  length  paths  must  require  at  least  three 
queues  in  some  node.  This  matches  the  known  upper 
bound  of  three  queues  per  node  for  deadlock-free,  min¬ 
imal  routing  on  cycle  [7]  and  torus  networks  [2j. 

The  remainder  of  this  paper  is  organized  as  follows.  Def¬ 
initions  and  a  formal  description  of  the  routing  model 
are  given  in  Section  2.  Section  3  presents  an  exam¬ 
ple  of  a  deadlock-free,  adaptive  routing  algorithm  in 
which  it  is  impossible  to  order  the  queues  so  that  ev¬ 
ery  packet  always  has  the  possibility  of  moving  to  a 
higher  ordered  queue.  The  fact  that  every  deadlock-free, 
adaptive  queue-reservation  algorithm  can  be  restricted 
to  obtain  a  deadlock-free,  oblivious  algorithm  is  proven 
in  Section  4.  Lower  bounds  on  the  storage  requirements 
for  deadlock-free  minimal  queue-reservation  algorithms 


are  given  in  Section  5. 

2  Preliminaries 

We  will  view  a  routing  network  as  being  an  undirected 
graph  in  which  the  nodes  represent  processors  and  the 
edges  represent  communication  links.  Each  node  con¬ 
tains  a  set  of  queues,  one  of  which  will  be  called  an 
injection  queue,  another  one  of  which  will  be  called  a  de¬ 
livery  queue,  and  the  remainder  of  which  will  be  called 
standard  queues.  Packets  can  enter  the  routing  network 
only  by  being  placed  in  an  empty  iiyection  queue  in  their 
source  node,  and  they  can  be  removed  from  the  network 
only  when  they  are  in  the  delivery  queue  of  their  des¬ 
tination  node.  We  will  assume  throughout  that  each 
queue  can  hold  exactly  one  packet  and  that  the  number 
of  queues  is  finite. 

Given  the  queue  in  which  a  packet  is  currently  stored, 
and  given  the  packet’s  source  and  destination  nodes,  a 
routing  algorithm  specifies  a  set  of  queues  to  which  the 
packet  may  be  moved.  More  formally,  the  color  of  a 
packet  is  the  pair  (s,  d)  where  s  is  the  packet’s  source 
node  and  d  is  the  packet’s  destination  node.  We  will  say 
that  a  queue  has  color  c  if  it  contains  a  packet  with  color 
c.  A  routing  algorithm  A  is  a  function  which  associates 
a  set  of  queues,  called  a  waiting  set,  with  each  possible 
queue  and  color  pair  {q,  c).  The  wuting  set  which  A 
associates  with  the  pair  (g,  c)  will  be  denoted  A(g,c). 
Given  a  routing  algorithm  A,  a  queue  g  is  reachable  by 
a  packet  p  with  color  c  if  and  only  if  there  exists  some 
path  go>gi,--.>g*  such  that  go  is  the  injection  queue 
in  p’s  source  node,  gt  =  g,  and  for  all  t,  1  <  >  <  k, 
qi  €  A(g,-i ,  c).  It  is  required  that  the  waiting  set  A(g,  c) 
be  empty  if  and  only  if  either  g  is  a  delivery  queue  or  g 
is  not  reachable  by  a  packet  with  color  c. 

All  of  the  queues  in  a  waiting  set  A(g,  c)  must  either  be 
in  the  node  which  contains  g  or  in  neighboring  nodes 
(that  is,  nodes  that  are  connected  by  an  edge  to  the 
node  containing  g).  An  ipjection  queue  is  never  allowed 
to  appear  in  a  waiting  set,  and  a  delivery  queue  must 
only  be  reachable  by  packets  destined  for  the  node  con- 
ttuning  the  delivery  queue.  Furthermore,  if  ga  6  A(gi,  c) 
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and  if  either  qi  is  an  injection  queue  or  92  is  ^  delivery 
queue,  then  qi  and  92  must  be  in  the  same  node.  Thus 
injection  and  delivery  queues  are  used  only  for  placing 
new  packets  in  the  network  and  for  removing  packets 
once  they  have  reached  their  destination 

The  routing  algorithm  operates  asynchronously.  A 
packet  with  color  c  may  move  from  a  queue  qi  to  any 
empty  queue  52  €  A(gi ,  c)  at  any  time,  and  a  new  packet 
with  an  arbitrary  destination  may  be  placed  in  an  empty 
injection  queue  at  any  time.  Packets  may  be  trans¬ 
mitted  in  either  store-and-forward  [15]  or  virtual  cut- 
through  [1 1]  mode.  The  only  requirement  is  that  when 
a  packet  is  moved  from  one  queue  to  another,  it  occu¬ 
pies  both  of  the  queues  for  a  finite  amount  of  time,  and 
after  a  finite  amount  of  tim<“  the  former  queue  becomes 
empty. 

A  routing  algorithm  is  oblivious  if  every  waiting  set  con¬ 
tains  at  most  one  queue,  and  it  is  adaptive  otherwise. 
Routing  algorithm  A  is  a  restriction  of  routing  algo¬ 
rithm  B  if  and  only  if  for  every  pair  {q,c),  A{q,c)  C 
B(q,c),  and  for  some  pair  (7,c),  A{q,c)  ^  B{q,c).  A 
routing  algorithm  is  minimal  if  every  packet  is  routed 
from  its  source  node  to  its  destination  node  while  visit¬ 
ing  the  minimum  number  of  nodes  possible.  Note  that 
the  concept  of  minimality  is  based  on  the  number  of 
nodes  visited,  rather  than  the  number  of  queues  visited. 

A  configuration  is  a  nonempty  set  5  of  queues  such  that 
each  queue  9  in  5  is  either  empty  or  has  color  c  where  q  is 
reachable  by  a  packet  with  color  c.  The  set  of  queues  S 
will  be  called  the  critical  set  of  the  configuration.  Given 
a  configuration  T  with  critical  set  5  and  given  any  queue 
q  €  S,  the  notation  T{q)  will  denote  9’s  color  in  config¬ 
uration  T  (or  the  value  “empty”  if  it  does  not  contain 
a  packet).  A  deadlock  configuration  for  a  routing  algo¬ 
rithm  A  is  a  configuration  with  a  critical  set  5  such  that 
none  of  the  queues  in  5  is  a  delivery  queue,  none  of  the 
queues  in  S  is  empty,  and  for  each  queue  q  in  S,  q  has 
color  c  where  A{q,  c)  C  S.  A  configuration  is  routable  if 
and  only  if  it  is  possible  to  start  with  an  empty  network 
and  to  route  packets  so  as  to  obtain  the  configuration. 

‘  It  should  be  noted  that  the  injection  and  delivery  queues  are 
introduced  only  to  simplify  the  description  of  the  model,  and  that 
they  need  not  be  physically  present  in  an  actual  routing  network. 


A  routing  algorithm  is  deadlock-free  if  and  only  if  it  has 
no  routable  deadlock  configuration.  It  is  straightfor¬ 
ward  to  Verify  that  this  definition  of  deadlock-freedom 
does  in  fact  correspond  to  the  impossibility  of  obtain¬ 
ing  deadlock  when  using  the  given  routing  algorithm. 
Finally,  given  any  two  configurations  T'  and  T"  with 
critical  sets  S'  and  S",  respectively,  T'  ®  T"  will  de¬ 
note  the  configuration  T  with  critical  set  5  =  5'  U  5" 
in  which  for  each  queue  q  €  S',  T{q)  =  T'{q),  and  for 
each  queue  qeS"\S',  T(q)  =  T"(q).  Thus  T'  ®  r'  is 
obtained  by  taking  configuration  T'  and  adding  to  it  all 
of  those  queues  in  T"  which  are  not  also  in  T . 

3  Deadlock-Freedom  Without 
Ordering  Queues 

In  this  section  we  will  show  that  the  standard  technique 
of  ordering  the  queues  so  that  every  packet  always  has 
the  possibility  of  moving  to  a  higher  ordered  queue  is  not 
necessary  for  the  prevention  of  deadlock.  In  particular, 
we  will  give  a  simple  example  of  an  adaptive  routing 
algorithm  which  is  provably  free  of  deadlock  and  yet 
has  no  such  ordering  of  the  queues. 

The  example  is  routing  algorithm  A  shown  in  Figure  1, 
in  which  each  circle  represents  a  queue  and  each  2irc  rep¬ 
resents  a  possible  move  between  queues.  There  are  three 
injection  queues  labeled  /i,  I2  and  /a,  and  three  delivery 
queues  labeled  Di,  D2  and  D3.  In  addition,  there  are 
six  standard  queues  labeled  Xi,  X2,  X3,Yi,  Y2  and  Y3. 
We  will  consider  only  three  colors  of  packets,  namely  Ci , 
C2  and  C3,  where  packets  with  color  Ci,  1  <  i  <  3,  are 
injected  in  queue  li  and  delivered  from  queue  A-  The 
label  associated  with  each  arc  specifies  which  color  pack¬ 
ets  are  allowed  to  make  the  given  move  between  queues. 
For  example,  A(/i,Ci)  =  {Ai},  A(Ai,Ci)  =  {Xa,.!!}, 
and  A{Xi,C2)  =  Of  course  a  complete  routing 

algorithm  would  provide  routes  for  packets  with  other 
colors  and  would  include  an  assignment  of  the  queues  to 
the  nodes  in  a  routing  network.  However,  it  is  straight¬ 
forward  to  extend  the  given  example  by  adding  addi¬ 
tional  queues  for  the  packets  with  other  colors  and  to 
assign  the  queues  to  nodes  in  a  routing  network  without 
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changing  the  deadlock  or  queue  ordering  properties  of 
the  example. 

Lemma  3.1  The  routing  algorithm  A  shown  in  Fig¬ 
ure  1  is  free  of  deadlock. 

Proof:  Assume  for  the  sake  of  contradiction  that  dead¬ 
lock  is  possible,  in  which  case  there  must  be  some 
routable  deadlock  configuration  with  a  nonempty  crit¬ 
ical  set  S.  Clearly,  the  delivery  queues  cannot  appear 
in  S.  Similarly,  Vj,  Y2  and  Y3  cannot  appear  in  S  be¬ 
cause  they  are  only  reachable  by  packets  which  are  able 
to  move  directly  to  a  delivery  queue.  Also,  note  that  if 
injection  queue  /<,  1  <  i  <  3,  is  in  5,  then  queue  Xi 
must  also  be  in  S.  Therefore,  at  least  one  of  the  queues 
Xi  must  be  in  S.  Because  none  of  the  queues  Yi  is  in  S, 
it  follows  that  if  Xi  is  in  5  it  must  have  color  C2  in  the 
deadlock  configuration,  if  X2  is  in  5  it  must  have  color 
Cl  in  the  deadlock  configuration,  and  if  X3  is  in  S  it 
must  have  color  C3  in  the  deadlock  configuration.  Note 
that  X3  must  be  in  S,  because  otherwise  either  Xi  or  X2 
must  be  in  5,  and  X3  €  A{Xi ,  C2)  and  X3  €  A{X2,Ci). 
Because  Xi  €  A(X3,C3)  and  X2  €  A{X3,C3),  both  Xi 
and  X2  must  be  in  S.  Therefore,  the  deadlock  configu¬ 
ration  must  include  a  C2  packet  in  Ai  and  a  Ci  packet  in 
X2.  However,  it  is  impossible  to  simultaneously  route  a 
C2  packet  to  Xi  and  a  Ci  packet  to  X2,  so  the  deadlock 
configuration  is  not  routable,  which  is  a  contradiction. 
□ 

Lemma  3.2  There  is  no  ordering  of  the  queues  shown 
in  Figure  1  such  that  every  packet  always  has  the  possi¬ 
bility  of  moving  to  a  higher  ordered  queue. 

Proof:  Assume  for  the  sake  of  contradiction  that  such 
an  ordering  is  possible.  Because  ^(A'l, C2)  =  {ATa}  and 
A{X2,Ci)  =  {Afa},  queue  X3  must  be  higher  ordered 
than  both  Xi  and  X2.  However,  ^(XajCa)  =  {Afi,  Afa}. 
so  either  Xi  or  X2  (or  both)  must  be  higher  ordered 
than  X3,  which  is  a  contradiction.  □ 

Combining  the  two  previous  lemmas  yields  the  following 
theorem. 


Theorem  3.3  There  exists  an  adaptive  routing  algo¬ 
rithm  which  is  free  of  deadlock,  and  for  which  there  is 
no  ordering  of  the  queues  such  that  every  packet  always 
has  the  possibility  of  moving  to  a  higher  ordered  queue. 

4  Restrictions  of  Adaptive 
Routing  Algorithms 

In  this  section  we  will  show  that  every  deadlock-free, 
adaptive  packet  routing  algorithm  can  be  restricted  to 
obtain  an  oblivious  algorithm  which  is  also  deadlock- 
free.  The  proof  will  depend  on  the  following  lemma. 

Lemma  4.1  Let  A  be  any  deadlock-free,  adaptive  rout¬ 
ing  algorithm,  let  qi  be  any  queue,  and  let  ci  be  any 
color  such  that  |i4(9i,Ci)|  >  2.  Let  92  any  queue 
such  that  q2  €  A(9i,ci),  and  let  B  be  the  restriction  of 
A  obtained  by  removing  92  from  the  waiting  set  associ¬ 
ated  with  (91,  Cj).  If  B  is  subject  to  deadlock,  then  there 
must  exist  some  routable  deadlock  configuration  for  B 
in  which  queue  91  contains  a  packet  with  color  c\ . 

Proof:  Because  B  is  subject  to  deadlock,  there  must 
exist  some  configuration  T  which  is  a  deadlock  config¬ 
uration  for  B  and  which  is  routable  by  B.  Because  B 
is  a  restriction  of  A,  it  follows  that  configuration  T  is 
also  routable  by  A.  However,  A  is  deadlock-free,  so  T 
must  not  be  a  deadlock  configuration  for  A.  Therefore, 
queue  91  must  have  color  ci  in  configuration  T.  □ 

Theorem  4.2  Given  any  adaptive,  deadlock-free  rout¬ 
ing  algorithm  A,  there  exists  an  oblivious,  deadlock-free 
routing  algorithm  B  which  is  a  restriction  of  A. 

Proof  Sketch:  Assume  for  the  sake  of  contradiction 
that  the  claim  is  false.  Then  there  must  exist  some 
adaptive,  deadlock-free  routing  algorithm  A  such  that 
every  routing  algorithm  A'  which  is  a  restriction  of  A 
is  subject  to  deadlock.  Let  A  be  such  a  deadlock-free 
routing  algorithm,  let  91  be  any  queue,  and  let  ci  be  any 
color  such  that  |A(9i,  ci)|  >  2.  Let  92  and  93  be  any  dis¬ 
tinct  queues  such  that  92  €  A(9i,ci)  and  93  €  A(9i,Ci), 
let  A'  be  the  restriction  of  A  obtained  by  removing  92 
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from  the  waiting  set  associated  with  (gi,  ci),  and  let  A" 
be  the  restriction  of  A  obtained  by  removing  qa  from 
the  waiting  set  associated  with  (qi,ci).  It  follows  from 
Lemma  4.1  that  there  exists  a  configuration  T'  (sim¬ 
ilarly,  T")  which  is  a  routable  deadlock  configuration 
for  A'  (similarly,  A")  and  in  which  queue  qi  contains 
a  packet  with  color  ci .  Let  T  =  T'  ®  T"  and  let  S  be 
the  critical  set  of  T.  Let  R  be  the  configuration  which 
also  has  critical  set  5  but  in  which  all  of  the  queues  are 
empty.  Note  that  the  following  properties  hold; 

Property  1:  Configuration  T  is  a  deadlock  configura¬ 
tion  for  A. 

Property  2:  Configuration  i2  is  a  routable  configura¬ 
tion  for  A. 

Property  3:  The  set  S  is  the  critical  set  of  both  con¬ 
figuration  T  and  configuration  R. 

Property  4:  Every  nonempty  queue  g  in  iZ  has  a  color 
c  such  that  A(g,c)  C  S. 

We  will  define  an  algorithm  for  transforming  R  and  T 
while  maintaining  Properties  1  through  4  listed  above. 
The  algorithm  will  repeatedly  add  packets  to  empty 
queues  in  R  until  none  of  the  queues  in  R  is  empty. 
At  this  point  R  will  be  a  routable  deadlock  configura¬ 
tion,  which  will  be  the  desired  contradiction.  The  algo¬ 
rithm  for  transforming  R  and  T  consists  of  repeatedly 
performing  the  following  subroutine  until  R  contains  no 
empty  queues. 

First,  select  an  arbitrary  queue  g  which  is  empty  in  R. 
Let  c  =  r(g).  Because  queue  g  is  reachable  by  some 
packet  p  with  color  c  (from  the  definition  of  a  configu¬ 
ration),  there  must  exist  a  simple  path  from  p’s  injection 
queue  to  queue  g.  Define  the  configuration  P  in  which 
the  critical  set  consists  of  all  of  those  queues  that  appear 
in  this  simple  path,  and  in  which  all  of  the  queues  in  the 
critical  set  contain  a  packet  with  color  c.  Transform  R 
to  become  the  configuration  obtained  by  adding  P  and 
R  (that  is,  perform  the  assignment  R*-  P®  R),  trans¬ 
form  T  to  become  the  configuration  obtained  by  adding 
P  and  T  (that  is,  perform  the  assignment  T  *—  P®  T), 
and  let  5  be  the  critical  set  of  the  transformed  configu¬ 
rations  R  and  T.  Note  that  at  this  point  R  is  routable, 


because  it  is  possible  to  first  route  packets  with  the  de¬ 
sired  colors  to  all  of  the  nonempty  queues  in  R  which 
are  not  in  P  and  then  fill  the  queues  in  P  with  p2ickets 
with  color  c.  Also,  note  that  at  this  point  T  may  not 
be  a  deadlock  configuration,  because  it  is  possible  that 
some  of  the  packets  in  P  have  waiting  sets  that  include 
queues  which  are  not  in  5. 

Next,  select  an  arbitrary  queue  g'  in  the  simple  path 
described  above  such  that  A(g',  c)^S  (if  such  a  queue 
exists).  Let  q"  be  the  successor  of  q'  in  the  simple  path 
described  above  (note  that  q"  must  exist  if  g'  exists, 
because  A(g,  c)  C  5  so  g'  ^  g).  Let  A'  be  the  restric¬ 
tion  of  A  obtained  by  removing  q"  from  the  waiting 
set  associated  with  (g',c).  It  follows  from  Lemma  4.1 
that  there  exists  a  configuration  D  which  is  a  routable 
deadlock  configuration  for  A'  and  in  which  queue  g'  con¬ 
tains  a  packet  with  color  c.  Let  O'  be  the  configuration 
with  the  same  critical  set  as  D  but  in  which  all  of  the 
queues  are  empty.  Transform  R  to  become  the  config¬ 
uration  obtained  by  adding  R  and  D'  (that  is,  perform 
the  assignment  R  *—  R®  O'),  transform  T  to  become 
the  configuration  obtained  by  adding  T  and  D  (that 
is,  perform  the  assignment  T  *—  T  ®  D),  and  let  S  be 
the  critical  set  of  the  transformed  configurations  R  and 
T.  Repeat  this  procedure  of  selecting  a  queue  g'  in  the 
simple  path  such  that  A(g',  c)  ^  S  and  transforming  R, 
T  and  S  until  no  such  queue  q'  exists.  When  no  such 
queue  q'  exists,  return  from  the  subroutine. 

Note  that  upon  returning  from  the  subroutine  Proper¬ 
ties  1  through  4  above  must  hold.  Also,  note  that  any 
queue  in  R  which  was  nonempty  before  calling  the  sub¬ 
routine  will  again  be  nonempty  after  calling  the  subrou¬ 
tine.  Finally,  note  that  following  the  call  to  the  subrou¬ 
tine,  R  contains  at  least  one  additional  nonempty  queue. 
Because  the  number  of  queues  is  finite,  this  procedure 
must  terminate,  at  which  point  R  is  both  routable  by  A 
and  a  deadlock  configuration  for  A,  which  is  a  contra¬ 
diction.  □ 
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5  Minimal  Routing  in  Cycle  and 
Torus  Networks 

In  this  section  we  will  prove  lower  bounds  on  the  number 
of  queues  per  node  that  are  required  for  deadlock-free, 
minimal  routing  in  cycle  and  torus  networks.  Our  ap¬ 
proach  will  be  to  first  prove  a  lower  bound  on  the  queue 
requirements  of  deadlock-free,  minimal,  oblivious  rout¬ 
ing  algorithms  for  cycle  networks.  We  will  then  use  this 
lower  bound,  along  with  Theorem  4.2  and  the  fact  that 
a  torus  network  can  be  decomposed  into  disjoint  cycles, 
to  obtain  a  lower  bound  on  the  queue  requirements  of 
deadlock-free,  minimal  routing  algorithms  for  both  cy¬ 
cle  and  torus  networks. 

Lemma  5.1  Let  routing  algorithm  A  be  any  deadlock- 
free,  minimal,  oblivious  routing  algorithm  for  a  cycle 
network  with  n  nodes.  The  cycle  network  must  contain 
at  least  3n  —  12  standard  queues. 

Proof;  Because  A  is  deadlock-free  and  oblivious,  it  fol¬ 
lows  that  there  exists  an  ordering  of  the  queues  such 
that  every  packet  visits  the  queues  in  ascending  or¬ 
der  [18].  Let  k  =  [n/2J  —  1  (so  either  n  =  2it  +  2 
or  n  =  2ifc  +  3).  We  will  say  that  a  packet  is  routed 
in  the  clockwise  direction  if  it  visits  queues  in  nodes 
of  the  form  i,  (» -1-  1)  mod  n,  (t  +  2)  mod  n,...,j,  and 
in  the  counterclockwise  direction  otherwise.  Note  that 
for  each  node  i,  0  <  t  <  n,  algorithm  A  routes  pack¬ 
ets  from  node  i  to  node  (t  -H  k)  mod  n  in  the  clockwise 
direction.  Therefore,  for  each  node  i,  0  <  i  <  n, 
there  must  exist  an  ascending  sequence  of  standard 
queues  s,-_j ,  i  •  •  •  i  ^t,(i4-fc)modn  where  each 

queue  of  the  form  Sij  is  located  in  node  j  (for  exam¬ 
ple,  let  Sij  be  the  highest  ordered  standard  queue  in 
node  j  which  is  visited  by  a  packet  with  source  node 
I  and  destination  node  (t  k)  mod  n).  For  each  t, 

0  ^  ^  ~  *<,»'  I  *»,(i+l)nio<)n  I  •  ■  •  I  *i,(i+*)modn 

denote  the  ascending  sequence  of  standard  queues  be¬ 
ginning  in  node  i.  Let  h  =  n  —  I  —  k,  note  that 
Sh  =  *A,A  ,  *A,A+i  ,  ■  ■  •  ,  SA,n-i,  and  note  that  So  and 
Sh  are  disjoint. 


We  will  say  that  a  sequence  of  queues  is  a  clockwise- 
increasing  (similarly,  counterclockwise-increasing)  se¬ 
quence  if  when  the  queues  are  visited  in  ascending  order, 
the  nodes  containing  the  queues  are  visited  in  clock¬ 
wise  (counterclockwise)  order.  We  will  show  that  there 
must  exist  at  most  three  mutually  disjoint  clockwise- 
increasing  sequences  of  queues,  the  total  length  of  which 
is  at  least  n  +  k.  There  are  two  cases: 

Case  1:  ‘There  exists  a  clockwise-increasing  sequence 
of  standard  queues  X  =  xq  ,  xi ,  . . .  ,  Xn-i  such 
that  for  each  i,  0  <  i  <  n,  z.  is  located  in  node  t. 
In  this  case,  we  have  two  subcases: 

Case  la:  There  exists  ana,  0<o<n  —  1, 
such  that  Sa  and  X  are  disjoint.  In  this 
case,  the  two  disjoint  clockwise-increasing  se¬ 
quences  are  Sa  and  X,  and  their  total  length 
is  n  -I-  it  -h  1. 

Case  lb:  For  each  i,  0  <  i  <  n  —  1,  S,-  and  X 
intersect.  In  this  case,  let  a  be  the  largest 
value  of  t,  0  <  t  <  n— 1,  such  that  there  exists 
a  value  i'  >  i  where  s,_<»  =  Zj/ .  Let  a'  be  any 
value  such  that  a'  >  o  and  8a,a'  =  Let 
6  =  (a-l- 1)  mod  n  and  let  b'  be  any  value  such 
that  Siy  =  Xi'.  Note  that  a'  >  a  >  k  >  b'. 
Let  Y  be  the  sequence 

,  •  •  •  ,  .**',•••>  ®o'  > 

*a,a'+l  I  •  •  •  1  aa_(a.f t)mo<in  • 

The  sequence  Y  is  clockwise-increasing  and 
has  length  n  +  k. 

Case  2:  There  does  not  exist  such  a  sequence  X.  In 
this  case,  we  have  two  subcases: 

Case  2a:  There  exists  an  a,  0  <  a  <  h,  such 
that  5a  and  5o  are  disjoint  and  such  that 
Sa  and  Sk  are  disjoint.  In  this  case,  the 
three  disjoint  clockwise-increasing  sequences 
are  So,  Sa,  and  Sk,  and  their  total  length  is 
3k -1- 3  >  n -h  it. 

Case  2b:  For  each  t,  0  <  t  <  h,  either  5,  and 
So  intersect  or  Si  and  Sk  intersect,  but  not 
both.  In  this  case,  let  a  be  the  largest  value 
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of  t  in  the  range  0  <  >  <  h  such  that  Si  and 
So  intersect.  Let  a'  be  any  value  such  that 
Sa,a‘  =  «o,a'-  Let  6  =  0+1  and  let  b'  be  any 
value  such  that  S(,t'  =  (note  that  such 
a  b'  must  exist  because  of  the  definition  of  a 
and  the  fact  that  Sh  does  not  intersect  5o). 
Let  y  be  the  sequence 

S0,0  1  •  •  •  1  SO,a'— 1  1  8a, a'  i  •  ■  •  i  8a,a+fc 
and  let  Z  be  the  sequence 

86,4  1  •  •  •  ,  86,6'  I  «ft,6'+l  ,  •  •  •  )  8h,n-l. 

Note  that  Y  and  Z  are  clockwise-increasing 
sequences  and  that  they  must  be  disjoint 
(because  otherwise  there  would  exist  a 
clockwise-increasing  sequence  X  spanning  all 
of  the  nodes).  Also,  note  that  the  length  of 
Y  is  a+Jb+ 1  and  the  length  of  Z  is  n  —  o  - 1, 
so  their  total  length  is  n  +  )b. 

Thus,  in  any  case  there  must  exist  at  most  three  mu¬ 
tually  disjoint  clockwise-increasing  sequences  of  queues, 
the  total  length  of  which  is  at  least  n  +  k.  An  anal¬ 
ogous  argument  can  be  used  to  show  that  there  must 
exist  at  most  three  mutually  disjoint  counterclockwise- 
increasing  sequences  of  queues,  the  total  length  of  which 
is  at  least  n+k.  Because  a  clockwise-increasing  sequence 
of  queues  and  a  counterclockwise-increasing  sequence  of 
queues  can  intersect  in  at  most  one  queue,  it  follows  that 
the  entire  collection  of  clockwise-increasing  sequences 
and  counterclockwise-increasing  sequences  contains  at 
least  (n  +  jfc)  +  (n  +  ib)  —  (3  *  3)  =  2n  +  24  —  9  >  3n  —  12 
distinct  queues.  □ 

Theorem  5.2 

Lei  routing  algorithm  A  be  any  deadlock- free,  minimal 
routing  algorithm  for  a  cycle  network  with  13  or  more 
nodes  or  for  a  torus  network  in  which  at  least  one  of 
the  dimensions  is  of  length  13  or  greater.  The  cycle  or 
torus  network  must  contain  at  least  one  node  which  has 
three  or  more  standard  queues. 

Proof;  The  claim  for  a  cycle  network  follows  immedi¬ 
ately  from  Theorem  4.2  and  Lemma  5.1.  The  claum  for 


a  torus  network  follows  from  Theorem  4.2,  Lemma  5.1, 
and  the  observation  that  a  torus  can  be  decomposed 
into  disjoint  cycles  such  that  all  minimal  length  paths 
between  pairs  of  nodes  within  a  cycle  lie  within  the  cy¬ 
cle.  O 
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Figure  1:  A  deadlock-free,  adaptive  routing  algorithm  A  for  which  the  technique 
of  ordering  the  queues  cannot  be  used  to  prove  freedom  from  deadlock. 
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The  Slide  Mechanism  with  Applications  in  Dynamic 

Networks 

(Extended  Abstract) 

Yehuda  Afek*  Eli  Gafni^  Adi  Rosen* 


Abstract 

This  paper  presents  a  simple  and  efficient  building 
block,  called  slide,  for  constructing  communication  pro¬ 
tocols  in  dynamic  networks  whose  topology  frequently 
changes.  We  employ  slide  to  derive  (1)  an  end-to-end 
communication  protocol  with  optimal  amortized  mes¬ 
sage  complexity,  and  (2)  a  general  method  to  eflSciently 
and  systematically  combine  dynamic  and  static  algo¬ 
rithms.  (Dynamic  algorithms  are  designed  for  dynamic 
networks,  and  static  algorithms  work  in  networks  with 
stable  topology.) 

The  new  end-to-end  communication  protocol  has 
amortized  message  communication  complexity  0{n) 
(assuming  that  the  sender  is  allowed  to  gather  enough 
data  items  before  transmitting  them  to  the  receiver), 
where  n  is  the  total  number  of  nodes  in  the  network 
(the  previous  best  bound  was  0(m),  where  m  is  the  to¬ 
tal  number  of  links  in  the  network).  This  protocol  also 
has  bit  communication  complexity  O(nD),  where  £>  is 
the  data  item  size  in  bits  (assuming  data  items  ate  large 
enough;  i.e.,  for  D  =  O(nmlogn)).  In  addition  we  give, 
as  a  byproduct,  an  end-to-end  communication  protocol 
using  0{n^m)  messages  per  data  item,  which  is  consid¬ 
erably  simpler  than  other  protocols  known  to  us  (the 
best  known  end-to-end  protocol  has  message  complex¬ 
ity  0(nm)[AG91]).  The  protocols  above  combine  in  an 
interesting  way  several  ideas:  the  information  disper¬ 
sal  algorithm  of  Rabin  [Rab89],  the  majority  insight  of 
[AFWZ88,  AAF+],  and  the  slide  protocol. 

The  second  application  of  slide  develops  a  system- 
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atic  mechanism  to  combine  a  dynamic  algorithm  with 
a  static  algorithm  for  the  same  problem,  such  that  the 
combined  algorithm  automatically  adjusts  its  commu¬ 
nication  complexity  to  the  network  conditions.  That 
is,  the  combined  adgorithm  solves  the  problem  in  a  dy¬ 
namic  network,  and  if  the  network  stabilizes  for  a  long 
enough  period  of  time  then  the  algorithm’s  communi¬ 
cation  complexity  matches  that  of  the  static  algorithm. 
This  approach  has  been  first  introduced  in  [AM88]  in 
the  context  of  topology  update  algorithms. 

1  Introduction 

One  of  the  most  important  tasks  of  distributed  al¬ 
gorithm  designers  is  to  construct  simple  and  efficient 
building  blocks  for  various  network  models.  While 
simple  and  efficient  building  blocks  for  synchronous 
and  asynchronous  static  networks  have  been  presented 
[Awe85,  AAG87,  CL85,  BGP89,  AP90,  DS80],  only 
complicated,  though  theoretically  efficient,  algorithms 
are  known  for  dynamic  networks  whose  topology  fre¬ 
quently  change  (like  the  ARPANET  ([MRR80])  and 
DECNET  ([Wec80])). 

In  this  paper  we  present  a  simple  auid  efficient  build¬ 
ing  block,  called  slide,  for  dynamic  networks  with  fre¬ 
quently  changing  topologies  (i.e.  for  the  oo-delay  model 
[AG88]).  Essentially,  slide  establishes  a  non-fifo  virtual 
communication  link  between  two  arbitrary  nodes  in  the 
network.  Its  effectiveness  is  demonstrated  in  three  ap¬ 
plication^  : 

•  An  end-to-end  protocol  that  is  considerably  simpler 
then  any  previous  known  protocol  and  whose  mes¬ 
sage  communication  complexity  is  O(n’m)  (com¬ 
pared  to  the  best  known  0(nm)  messages), 

•  An  0{n)  amortized  message  complexity  end-to-end 
protocol,  assuming  that  the  sender  is  allowed  to 
gather  ei;iough  data  items  before  transmitting  them 
to  the  receiver,  and 

•  A  mechanism  that  senses  topological  changes  and 
controls  the  amortized  (per  topological  change) 
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communication  complexity  of  any  dynamic  algo¬ 
rithm.  If  a  static  algorithm  is  run  in  parallel  with 
a  dynamic  algorithm  controlled  by  the  mechanism 
the  resulting  algorithm  adjusts  its  communication 
complexity  to  the  network  conditions. 

In  addition  to  these  applications,  the  slide  proto¬ 
col  has  already  proved  its  versatility  and  applicability 
in  [APSV91]  where  it  was  shown  that  both  it  and  its 
first  two  applications  above  (with  minor  changes),  are 
self-stabilizing  (as  a  result  the  third  application  also  pre¬ 
serves  the  self-stabilization  property). 

The  end-to-end  protocols  presented  in  this  paper  are 
data  non-oblivious.  The  transport  protocol  uses  the 
data.  The  first  application,  the  majoriiy  algorithm, 
employs  the  majority  idea  introduced  in  [AFWZ88, 
AAF+j,  therefore  using  the  contents  of  the  delivered 
data-items  to  make  its  decisions.  Our  second  appli¬ 
cation,  the  data  dispersal  algorithm,  breaks  each  large 
data  item  into  many  packets  (using  the  information 
dispersal  algorithm  of  Rabin  [Rab89])  and  achieves  an 
0{n)  communication  complexity  (assuming  large  data 
items,  O(nmlogn)  bits).  This  is  in  contrast  to  other 
known  end-to-end  communication  algorithms,  which  are 
all  data-oblivious,  and  require  that  each  data  item  as  a 
whole  traverses  each  link  in  the  network,  resulting  in 
0(m)  communication  complexity.  Our  solution  does 
not  yield  0(n)  communication  complexity  algorithms 
for  smaller  data  items,  since  it  breaks  each  data  item 
into  0(nm)  pieces. 

These  algorithms  designed  to  tolerate  faults  have  high 
communication  complexity.  They  incur  this  complexity 
even  when  the  network  is  stable  (the  obvious  end-to- 
end  communication  algorithm  for  stable  networks  has 
0(n)  communication  complexity).  To  adapt  the  com¬ 
plexity  to  the  case  at  hand  the  second  part  of  this  paper 
uses  slide  to  develop  a  systematic  method  to  alleviate 
this  drawback  by  combining  a  dynamic  algorithm  with  a 
static  one  to  get  a  new  dynamic  algorithm  whose  com¬ 
munication  complexity  matches  that  of  the  static  al¬ 
gorithm  if  the  network  becomes  static,  and  still  works 
correctly  even  if  the  network  topology  never  stabilizes. 
The  method  applies  to  on-line  algorithms  in  the  sense 
that  these  algorithms  receive  input  values  during  their 
run,  they  never  terminate  and  they  produce  a  series 
of  output  values  (like  end-to-end  communication  algo¬ 
rithms).  The  complexity  of  such  algorithms  in  dynamic 
networks  is  usually  measured  as  the  maximum  cost  they 
incur  between  any  two  successive  output  events.  In  or¬ 
der  to  capture  in  one  complexity  measure  the  complex¬ 
ity  of  the  combined  algorithm  both  when  the  network  is 
dynamic  and  when  topological  changes  cease,  we  give  a 
two  component  amortized  cost:  The  cost  per  topologi¬ 
cal  change  plus  the  cost  per  output  (input)  event.  The 
former  captures  the  cost  of  the  algorithm  when  there 
are  topological  changes  in  the  network,  while  the  latter 


captures  the  cost  in  times  when  the  network  topology 
is  stable.  Given  a  static  algorithm  with  communication 
complexity  C,  and  a  dynamic  algorithm,  we  construct 
a  combined  algorithm  with  amortized  communication 
complexity  of  0(n  +  C,  +  m)  per  topological  change 
plus  0{Cs)  per  output  event. 

The  implementation  of  this  method  depends  on  the 
possibility  to  merge  the  outputs  of  two  algorithms  that 
in  parallel  solve  the  same  problem,  into  one  output 
stream.  The  merging  itself  is  not  systematic  and  de¬ 
pends  on  the  specific  problem  being  solved.  Such  a 
merging  mechanism  for  the  end-to-end  problem  was 
given  in  [AG88]  and  is  used  here  to  exemplify  the 
method,  resulting  in  an  algorithm  that  runs  in  dynamic 
networks,  and  achieves  0{n)  communication  complexity 
if  the  network  stabilizes. 

1.1  Related  work 

Techniques  The  algorithms  in  this  paper  combine 
several  techniques  from  recently  reported  research. 
The  slide  mechanism  combines  in  an  interesting  way 
ideas  from  [AMS89]  with  a  technique  appearing  in 
[MS80].  Combining  slide  with  a  concept  introduced 
in  [AFWZ88,  AAF"*"],  we  construct  an  end-to-end  algo¬ 
rithm,  the  majority  algorithm.  This  algorithm,  in  con¬ 
junction  with  the  information  dispersal  algorithm  (IDA) 
of  Rabin  [Rab89],  yields  our  data  dispersal  algorithm. 

In  the  second  part  of  the  paper  we  combine  the  slide 
protocol  with  a  an  idea  introduced  in  [AM88],  to  gen¬ 
erate  a  simple  and  systematic  method  for  combining  a 
dynamic  version  of  an  algorithm  with  a  static  version. 
The  method  is  exemplified  on  the  end-to-end  problem 
by  combining  the  bootstrap  mechanism  for  dynamic 
networks  of  [AG91]  with  a  static  bros^cast  and  echo 
along  a  path,  to  create  a  dynamic  algorithm,  that  auto¬ 
matically  adjusts  its  communication  complexity  to  the 
network  conditions. 

The  end-to-end  problem  One  approach  to  solve  the 
end-to-end  problem  in  a  dynamic  network  is  to  deliver 
the  data  items  over  a  fixed  path  and  to  construct  a  new 
path  every  time  a  link  on  the  path  fails.  Implementa¬ 
tions  of  this  approach  appear  in  [Fin79,  AAG87,  AS88, 
AAM89].  This  approach  requires,  however,  the  whole 
network  to  stabilize  for  a  period  of  time  allowing  the 
construction  of  the  path  and  the  communication  over  it 
[AAG87,  AS88,  AAM89],  or  at  least  requires  links  form¬ 
ing  some  path  between  the  sender  and  the  receiver  to 
be  operational  for  such  a  period  of  time  [AGH90]. 

In  [AE86]  it  is  stated  that  the  eventual  connectivity 
fairness  condition,  namely  that  there  is  no  edge-cut  all  of 
whose  links  are  permanently  down,  is  sufficient  for  com¬ 
munication  between  the  sender  and  the  receiver.  The 
problem  is  solved  in  [AE86,  Vis83]  under  this  condition 
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by  using  sequence  numbers  to  number  the  data  items 
sent  by  the  sender.  This  approach  yields  theoretically 
unbounded  algorithms,  as  the  sequence  numbers  grow 
with  the  number  of  data  items  transmitted. 

In  recent  years  a  sequence  of  works  gave  bounded, 
and  increasingly  efficient,  solutions  to  the  problem.  The 
first  bounded  end-to-end  communication  algorithm  un¬ 
der  the  eventual  connectivity  fairness  condition  was  pre¬ 
sented  in  [AG88];  however,  this  solution  has  exponen¬ 
tial  message  complexity  per  data  item.  Subsequently,  in 
[AMS89],  the  first  polynomial  message  complexity  al¬ 
gorithm  was  presented,  having  bit  communication  com¬ 
plexity  of  0(n®  -f-  mD).  Recently,  a  new  resynchroniza¬ 
tion  protocol  for  dynamic  networks  was  developed  in 
[AG91]  and  was  used  to  construct  an  end-to-end  com¬ 
munication  algorithm  whose  bit  communication  com¬ 
plexity  is  C>(nm logo  -|-  mD). 

Outline:  We  start  by  introducing  the  model  in  Sec¬ 
tion  2  and  give  a  description  of  the  slide  and  the  end-to- 
end  communication  protocols  in  Section  3;  An  outline 
of  proof  of  correctness  and  complexity  analysis  is  given 
in  the  appendix.  Section  4  contains  the  method  to  turn 
any  dynamic  algorithm  into  a  dynamic  algorithm  that 
adjusts  its  communication  complexity  to  network  con¬ 
ditions. 

2  The  Model 

We  consider  a  communication  network  in  the  form  of  a 
graph  G(V,E),  |K|  =  n,  \E\  =  m,  where  the  nodes  are 
processors  and  the  edges  are  undirected  communication 
links.  Each  undirected  link  consists  of  two  directed  links, 
delivering  messages  in  opposing  directions. 

Each  link  heis  bounded  capacity.  By  capacity  we  refer 
to  the  maximal  number  of  messages  in  transit  on  a  given 
link  at  a  given  time.  The  bounded  capacity  can  be  either 
constant,  or  a  function  of  the  network  size.  For  clarity 
of  description,  we  consider  networks  in  which  each  link 
has  0(n)  capacity.  As  noted  in  the  analysis  our  algo¬ 
rithms  can  be  easily  modified  to  conform  to  the  constant 
capacity  model  without  effecting  their  complexities. 

The  communication  over  the  links  obey  the  FIFO  dis¬ 
cipline,  and  no  bound  on  the  transmission  delay  time  is 
known.  A  directed  link  is  non-viable  if  starting  from 
some  message  it  does  not  deliver  any  message.  The 
transmission  delay  time  for  this  message  and  the  subse¬ 
quent  messages  sent  on  this  link  is  considered  infinite. 
A  directed  link  is  viable  if  any  message  sent  over  it  is 
eventually  delivered.  Similarly,  an  undirected  link  is  vi¬ 
able  if  both  of  its  directed  links  are  viable.  (We  assume 
that  link  that  fails,  fails  in  both  directions.)  Two  nodes 
in  the  network  are  eventually  connected  if  there  is  a  path 
of  viable  links  connecting  them. 


The  model  defined  here  is  called  the  ’’oo-delay  model” 
in  [AG88],  and  as  stated  there  the  model  of  fruling  and 
recovering  links  [AAG87]  can  easily  be  reduced  to  it. 
When  combining  dynamic  algorithms  with  static  algo¬ 
rithms  (Section  4)  we  consider  the  model  of  [AAG87] 
in  which  topological  changes  are  part  of  the  input  to 
the  algorithm.  The  reduction  of  the  changing  topology 
model  of  [AAG87]  to  the  oo-delay  model,  adds  at  most 
0(1)  messages  per  topological  change  to  the  protocol’s 
communication  complexity  in  the  oo-delay  model. 

2.1  The  End  to  End  Communication 
Problem 

An  end-to-end  communication  algorithm  transmits  a  se¬ 
quence  of  data  items  from  a  sender  to  a  receiver.  The 
data  items  are  input  to  the  sender  on-line,  that  is  the 
sequence  is  not  known  at  the  beginning  of  the  opera¬ 
tion  of  the  algorithm  and  the  data  items  are  input  one 
by  one. 

The  algorithm  must  satisfy  the  following  properties: 

•  Safety:  At  any  time  the  sequence  of  the  data  items 
output  by  the  receiver  is  a  prefix  of  the  sequence  of 
data  items  input  by  the  sender. 

•  Liveness:  If  the  sender  and  the  receiver  are  even¬ 
tually  connected,  then  every  data  item  input  by  the 
sender  will  eventually  be  output  by  the  receiver. 

The  Complexity  Measures  Since  the  algorithm 
transmits  a  sequence  of  data  items  input  one  by  one,  the 
complexity  analysis  should  evaluate  the  performance  of 
the  algorithm  per  data  item.  This  evaluation  is  done  by 
measuring  the  cost  of  the  algorithm  between  any  two 
successive  data  item  output  events  at  the  receiver. 

We  consider  the  following  complexity  measures:  Mes¬ 
sage  Complexity:  The  number  of  messages  transmitted 
in  the  network  in  the  worst  case  between  any  two  suc¬ 
cessive  output  events  at  the  receiver;  Bit  Communica¬ 
tion  Complexity:  The  number  of  bits  transmitted  in  the 
network  in  the  worst  case  between  any  two  successive 
output  events  at  the  receiver;  Space  Complexity:  The 
maximal  amount  of  space  required  at  each  node,  per 
incident  link,  measured  in  bits  of  memory. 

3  Description  of  the  Algorithms 

When  presenting  the  code  of  the  algorithms  we  use  the 
guarded  commands  language  of  Dijkstra  [DF88],  where 
the  code  of  each  process  is  in  the  form  Gi  —>■  AxOG^  —*■ 
A2O  . .  .Gi  A/D.  The  code  is  executed  by  repeatedly 
selecting  an  arbitrary  i  from  all  guards  G,  which  are 
true  and  executing  A/.  A  guard  Gi  is  a  conjunction  of 
predicates.  The  predicate  Receive  M  is  true  when  a 
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message  M  is  available  to  be  received.  If  the  statements 
associated  with  this  predicate  are  executed,  then  prior 
to  this  execution  the  message  M  is  received.  The  mes¬ 
sage  may  contain  some  values  that  are  assigned,  upon 
its  receipt,  to  variables  stated  in  the  Receive  predicate 
(e.g.  Receive  TOKEN(data)). 

The  following  sections  describe  the  various  algo¬ 
rithms;  An  outline  of  the  correctness  proofs  and  analysis 
is  added  in  the  appendix. 

3.1  The  sMAt  Protocol 

In  this  protocol  one  node,  the  sender,  inputs  tokens  into 
the  network,  and  one  node,  the  receiver,  outputs  the  to¬ 
kens.  The  protocol  guarantees  that  tokens  are  neither 
lost  nor  duplicated,  and  that  the  total  number  of  to¬ 
kens  in  the  network  at  any  given  time  is  bounded.  If 
the  sender  and  the  receiver  are  eventually  connected, 
then  eventually  the  input  of  a  new  token  to  the  network 
is  enabled  at  the  sender.  However,  tokens  are  output 
at  the  receiver  not  necessarily  in  the  order  they  are  in¬ 
put  at  the  sender.  Thus  the  slide  establishes  between 
the  sender  and  the  receiver  a  non-fifo,  bounded  capacity 
virtual  communication  link  that  neither  loses  nor  dupli¬ 
cates  messages. 

The  slide  protocol  is  implemented  by  storing  and 
transferring  tokens  between  the  nodes  of  the  network  as 
follows:  Each  undirected  link  is  considered  as  two  an- 
tiparallel  links.  Each  node  maintains  for  each  incident 
incoming  link  a  pile  of  slots  numbered  1  to  n.  Each  slot 
has  room  for  one  token,  and  each  pile  is  used  to  store 
tokens  arriving  on  the  link  associated  with  it;  Tokens 
from  a  pile  can  be  forwarded  over  any  outgoing  link. 
The  crux  of  the  protocol  is  that  a  token  is  sent  from 
any  slot  t  at  node  v  to  slot  j  at  the  (v,  u)  pile  at  node  u, 
only  if  j  <  i.  To  this  end  the  nodes  maintain  for  each 
outgoing  link  a  variable,  called  bound,  holding  an  upper 
bound  to  the  lowest  numbered  slot  available  at  the  other 
side  of  the  link.  The  tokens  are  sent  from  slots  with  a 
number  higher  than  the  bound,  and  thus  are  guaran¬ 
teed  to  conform  to  the  above  rule.  Every  time  a  token 
is  removed  from  a  pile,  a  signal  to  this  effect  is  sent  over 
the  incoming  link  associated  with  the  pile.  Since  the 
only  source  of  tokens  for  a  specific  pile  is  the  node  on 
the  other  end  of  its  associated  link,  the  bound  can  be 
maintained  by  incrementing  it  every  time  a  token  is  sent 
over  the  link,  and  decrementing  it  every  time  a  signal  is 
received  over  the  link.  Thus  the  bound  is  never  smaller 
than  the  number  of  tokens  in  the  pile  on  the  other  side 
of  the  link  plus  the  number  of  tokens  in  transit  over  the 
link. 

New  tokens  enter  the  network  only  at  the  sender,  to 
a  special  slot  at  level  n,  and  only  when  this  slot  is  var 
cant.  The  receiver  has  always  a  vacant  slot  of  level  1, 
and  removes  and  outputs  any  token  it  receives.  If  the 
sender  and  the  receiver  are  eventually  connected,  then 


Initialization  — » 

for  every  incident  link  e 
bound[e]:=l; 
top(e]:=0; 

□ 

Receive  SIGNAL  on  e  — ♦ 
boand{e]:=boand[e]-l ; 

□ 

Rec«ve  TOKEN(data)  on  e  — ► 
top[e]:=top[e]-H; 
slots[e][top[e]]:=data; 

□ 

3e,  e'  s.t.  top[e']  >  bound[e]  — ► 

/*  e’  not  necessarily  ^  e  */ 
send  TOKEN(slot8[e'][top[e']])  on  e; 
send  SIGNAL  on  e'; 
top[e']:=top[e^-l; 
bonnd[e]:=bonnd[e]-{-l; 

□ 


a:  ordinary’s  node  code 


Initialization  — >' 

uipat_pile[n];=Tacant; 

□ 

inpnt-pile[n]=Tacant  — ► 
input_pi]e[n]:=naxt  input; 

□ 

input.pile[n]^vacant 

and  3e  s.t.  bound[e]<  n  — ► 
send  TOKEN(inpat.pile[n])  on  e; 
inpat.pile[n]:acTacant ; 
bonnd[e]:=sbonnd[e]-f-l; 


b:  additions  for  the  sender 


Receive  TOKEN (data)  on  e  — ► 
ontpnt(data); 
send  SIGNAL  on  e  ; 


c:  receiver’s  code 


Figure  1:  The  slide 
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ewHitiialtjr  qMdal  slot  at  the  seader  is  vecaat.  Thus 
the  tokeas  dkle  ia  the  network  from  the  sender  to  the 
receiver  by  diding  fr<»n  hi^er  numbered  slots  to  Iowa 
numbered  slots  as  they  travel  over  links.  Clearly,  each 
token  can  make  at  most  n  bops  in  the  network.  Since 
the  protocol  maintains  for  each  link  2n  slots,  and  (as  we 
prove  in  the  sequel)  at  most  2n  tokens  can  be  in  tran¬ 
sit  on  each  link  at  any  given  time,  the  total  number  of 
tokens  in  the  network  is  at  most  0(nm).  This  is  the 
capacity  of  the  alidc,  denoted  Cap. 

The  code  of  the  algorithm  is  given  in  Figure  1;  It 
uses  two  types  of  messages;  TOKEN  messages  which 
are  used  to  transfer  the  tokens  themselves,  and  SIG¬ 
NAL  messages  that  are  used  to  inform  over  a  link  that 
a  token  from  the  pile  associated  with  it  was  removed 
from  the  pile.  The  differences  between  the  sender  and 
an  ordinary  node  are  due  to  the  fact  that  the  sender  is 
the  node  that  inputs  new  tokens  to  the  network.  The 
code  of  the  sender  is  the  code  of  an  ordinary  node  and  in 
addition  the  code  appearing  in  figure  lb.  Note  that  the 
sender  is  disabled  to  input  new  tokens  if  input-pile[n] 
is  not  vacant.  The  receiver  does  not  send  tokens  and 
outputs  any  token  it  receives.  Therefore,  its  code  is  re¬ 
stricted  to  the  code  appearing  in  figure  Ic. 

In  the  following  subsections  we  present  two  solutions 
to  the  problem  of  end-to-end  communication  in  dynamic 
networks,  that  use  the  slide  as  a  building  block.  The  sec¬ 
ond  solution,  the  data  dispersal  algorithm,  is  our  main 
result  that  achieves  linear  communication  cost  for  large 
enough  data  items. 

3.2  The  Majority  Algorithm 

Given  the  slide  we  construct  a  simple  end-to-end  com¬ 
munication  algorithm  by  operating  the  slide  from  the 
sender  (5)  to  the  receiver  {R)  and  combining  it  with  an 
idea  of  [AFWZ88,  AAF'*'].  Whenever  5  wishes  to  send 
a  dataritem  to  it  sends  consecutively  2-Cap-|- 1  dupli¬ 
cates  of  the  data  item  to  R  using  the  slide.  To  receive 
the  first  data  item  R  waits  for  Cap  -f  1  data  items  and 
outputs  one  of  them.  For  each  subsequent  data  item  it 
waits  for  2-Cap 1  data  items,  takes  the  majority  of  the 
values  received,  and  outputs  this  value.  Since  no  more 
than  Cap  data  items  can  be  delayed  in  the  network  at 
any  given  time,  the  majority  of  the  data  items  received 
represent  the  next  data  item  (see  Theorem  18  in  the 
appendix). 

The  sender’s  algorithm,  described  below  in  Figure  2, 
interfaces  with  the  sender’s  algorithm  of  the  slide  pro¬ 
tocol  in  a  way  that  each  token  sent  here  is  input  by  the 
slide  sender,  and  each  token  output  by  the  slide  receiver 
is  received  by  the  receiver  of  the  majority  algorithm. 

The  bit  communication  complexity  of  this  algorithm 
is  0(n^mD),  where  D  is  the  size  of  the  data  item. 

We  remark  that  by  adding  a  toggle  bit  to  each  data 
item  the  protocol  can  be  made  robust  against  leftover 


Initialize  — ► 
items-8et:=  ^ ; 

(ir8tJtem:=:tru«  ; 

□ 

Receive(dataritein)  — ► 

items-set:  =  items-set  U  dataritem  ; 
call  check-and-output  ; 

□ 


a:  receiver's  code 


true  — ► 

dataritem  :=n«xt  iiqmt; 
for  i:=l  to  2-Cap  1  do 
Send(dataritem); 
od 

□ 


b:  sender’s  code 


Procedure  check.and_ontput 

if  firstJtemsstrue  and  |item»-8et|  =  Cap  + 1  then 
/*  first  data  item  */ 
outpnt(any.dataJtem.of(items-8et))  ; 
items-set  :=  d  > 
fii8tJtem:=fal8e  ; 

else  if  fir8tJtem=:lalse  and  litems-setj  =  2-Cap  1 
then  /*  aU  other  data  items  */ 
oatput(majority(items-set))  ; 
items^t  :=  ^  ; 
endif 
endif 


c:  procedure  chedijtnd. output 

Figure  2:  The  Majority  Algorithm 
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data  items  that  can  be  in  the  network  before  the  algo¬ 
rithm  has  started. 

3.3  The  Data  Dispersal  Algorithm 

For  the  cases  were  the  data  items  are  large  with  respect 
to  the  size  of  the  network,  i.e.  having  size  of  Q(nm  log  n) 
bits,  we  construct  an  algorithm  that  achieves  linear 
(0(n))  bit  communication  complexity.  The  same  al¬ 
gorithm  can  also  be  used  for  smaller  data  items  if  the 
sender  is  allowed  to  wait  for  more  than  one  data  item 
and  transmit  several  data  items  together. 

In  order  to  derive  this  algorithm,  we  combine  Rabin’s 
Information  Dispersal  Algorithm  [Rab89]  with  the  fact 
that  the  slide  can  delay  only  a  bounded  number  of  pack¬ 
ets.  Using  the  Information  Dispersal  Algorithm  (IDA), 
the  sender  splits  the  data  item  into  packets  and  sends 
them  to  the  receiver  over  the  slide.  Since  the  IDA  allows 
the  construction  of  the  data  item  from  a  subset  of  the 
packets,  we  can  tolerate  the  loss  of  the  bounded  number 
of  packets  that  the  slide  can  hold. 

The  sender  creates  using  the  IDA  2  Cap+  1  packets, 
each  of  size  Q(<;ap+i )  where  D  is  the  size  of  the 
data  item;  The  receiver  is  thus  able  to  construct  the  full 
data  item  from  only  Cap  1  packets.  The  sender  sends 
the  2-Cap  + 1  packets,  and  as  the  slide  can  delay  at  most 
Cap  of  them  the  receiver  will  receive  enough  packets 
to  reconstruct  the  data  item.  The  only  problem  left  is 
to  make  sure  that  the  receiver  does  not  use  old  delayed 
packets  to  reconstruct  subsequently  sent  data  items.  To 
alleviate  this  problem  the  sender  adds  to  the  packets 
of  each  data  item  a  label.  The  receiver  outputs  the 
first  data  item  after  calculating  it  from  the  first  Cap  +  1 
packets  it  receives;  For  each  subsequent  data  item  it 
waits  for  2  Cap+  1  packets,  checks  which  label  has  the 
majority  among  the  labels  in  the  packets,  and  uses  only 
the  packets  having  this  label.  For  each  new  data  item 
the  sender  must  use  a  label  that  is  not  present  in  the 
network.  Therefore  the  receiver  sends  back  to  the  sender 
every  packet  it  receives  through  another  slide  operated 
in  the  opposite  direction.  Thus  the  sender  always  knows 
which  labels  are  present  in  the  network.  As  the  capacity 
of  each  slide  is  bounded  by  Cap,  2Cap-\- 1  different  labels 
suffice. 

In  the  code,  presented  in  Figure  3,  we  use  the  sub¬ 
scripts  and  *fi_5  to  denote  operations  with  re¬ 

spect  to  the  slide  from  the  sender  to  the  receiver  and 
the  slide  from  the  receiver  to  the  sender,  respectively. 
The  interface  with  the  slide  protocols  is  similar  to  this 
interface  for  the  majority  algorithm.  C  denotes  a  set 
of  2-Cap-\-  1  labels.  Note  that  the  function  extract  can 
extract  an  arbitrary  member  from  the  set  it  is  applied 
to. 

The  bit  communication  complexity  of  the  data  dis¬ 
persal  algorithm  is  O(nD),  if  it  is  applied  to  data  items 
of  size  Q(nmlogn)  bits.  If  the  algorithm  is  applied  to 


smaller  data  items,  it  achieves  an  amortized  bit  com¬ 
munication  complexity  of  O(nD),  by  combining  several 
data  items  together. 

4  Combining  Static  and  Dy¬ 
namic  Algorithms 

Dynamic  algorithms  that  are  designed  to  operate  cor¬ 
rectly  in  eventually  connected  dynamic  networks  usu¬ 
ally  suffer  the  drawback  that  even  if  the  network  be¬ 
comes  static  their  communication  complexity  does  not 
decrease.  In  this  section  we  use  a  variation  of  the  slide 
protocol,  called  generalized  slide  to  present  a  systematic 
methodology  to  combine  a  static  algorithm  with  a  dy¬ 
namic  one  into  a  single  dynamic  algorithm  whose  com¬ 
munication  complexity  matches  that  of  the  static  algo¬ 
rithm  if  the  network  topology  becomes  static  for  a  large 
enough  period  of  time,  and  operates  correctly  even  if 
the  network  topology  changes  frequently.  The  approach 
was  first  introduced  in  [AM88]  in  an  ad-hoc  manner  for 
a  specific  problem,  the  topology  update  problem. 

The  essence  of  our  method  is  a  mechanism  ensuring 
that  the  dynamic  algorithm  sends  no  more  than  k  mes¬ 
sages  per  topological  change  (i.e.  its  amortized  commu¬ 
nication  complexity  [AAG87,  AM88]  is  0(k)  per  topo¬ 
logical  change),  where  ib  is  a  design  parameter.  In  par¬ 
ticular,  if  topological  changes  cease,  the  dynamic  algo¬ 
rithm  grindes  to  a  halt.  To  ensure  that  progress  is  made 
in  that  case  (if  the  network  topology  stabilizes)  we  run 
in  paredlel  a  static  algorithm  used  in  conjunction  with 
the  reset  procedure  of  [AAG87].  Since  both  algorithms 
run  in  parallel,  their  output  should  be  merged  into  a 
single  series  of  output  events  for  the  problem  being  con¬ 
sidered.  The  implementation  of  the  merging  mechanism 
depends  on  the  particular  problem;  As  an  example  we 
describe  below  how  to  merge  two  algorithms  in  the  con¬ 
text  of  the  end-to-end  communication  problem. 

4.1  Generalized  slide 

The  slide  is  generalized  to  a  distributed  object  which 
interacts  with  two  sets  of  nodes,  the  producers  and  the 
consumers.  Producers  deposit  tokens  into  the  slide  and 
consumers  may  consume  tokens,  thus  extracting  them 
from  the  slide  locally.  The  generalized  slide  has  the 
following  properties:  (1)  At  any  time,  the  total  number 
of  tokens  consumed  is  no  more  than  the  total  number  of 
tokens  produced,  (2)  the  total  number  of  tokens  stored 
in  the  slide  at  any  given  time,  called  the  capacity,  is 
bounded,  (3)  each  token  performs  at  most  n  hops  in 
the  network,  and  (4)  if  a  consumer  wishes  to  consume 
tokens  and  is  eventually  connected  to  a  producer  that 
infinitely  many  times  produces  tokens,  then  eventually 
the  consumer  will  have  a  token  to  consume.  We  achieve 
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Initialization  — » 

£:=£  ; 

sending:=fal8e; 
inissing:=  0; 

□ 

sending=fals'e  and  missing  <  2 -Cap  — » 
data-item:=next  input; 
l:=extract(£); 

using  the  IDA  with  parameters 
Cap  +  1  and  2-Cap  +  1, 
create  packets  number  1  to  2-Cap  +  1; 
send_buffer:=|J?f (l,i, packet;); 
count[l]:=0; 
sending:=true; 

□ 

sending=true  — ► 

(l,i,packet):=extract(send-buifer); 

Sends-.R  (l,i, packet); 
count[l]:=count[l]+ 1 ; 

missing:=missing+ 1 ; 

if  |send.bufFer|=:0  then  8ending:=:false  ;  endif 

□ 

ReceiveR_s  (I, i, packet)  — ► 
missing:=missing-l; 
count  [1] : =count  [I]- 1 ; 
if  (count[l]=0)  then  C=  £  U  1;  endif 


a:  sender’s  code 


Initialization  — >■ 
packets-setrs^ 
first  Jtem:=true; 
packets-to-return  := 

□ 

Receives_j^rt,i, packet)  — ► 

packets-set:=packets-set  U  (l,i, packet) 
call  check-and.output; 

□ 

packets-to-return  ^  ^  — ► 

(l,i,packet);=extract(packets-to-return); 
Sendii_5  (l,i, packet); 


b:  receiver’s  code 


Procedure  check-and.output 
if  first  Jtem=true  and 

|packets-set|=Cap  1  then 

/*  first  data  item  */ 

using  the  IDA  calculate  the  data  item  from  the 
Cap  1  packets  in  packets-set; 
output(data-item); 

patckets-to-return:=:packets-to-return  U  packets-set; 
packets-8et:=^; 
fi  rst  Jtem : =f  alse ; 
else  if  first  Jtem=falBe  and 

|packetS'Set|=2-Cap  -I- 1  then 

/♦  all  other  data  items  */ 
majority-label:=majority-of-labelB(packets-set); 
using  the  IDA  c^dculate  the  data  item 

from  the  packets  in  packets-set 
having  the  label  ’majority-label’; 
outpnt(data-item); 

packets-to-return:=:packets-to-return  U  packets-set; 
packets-set:=^; 

endif 

endif 


c:  procedure  check-ancLoutput 


Figure  3;  The  Data  Dispersal  Algorithm 
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this  by  requiring  strong  fairness  in  token  distribution  as 
outlined  below. 

Briefly  stated,  given  the  slide  of  Section  3,  the  gen¬ 
eralized  slide  is  implemented  as  follows:  An  additional 
pile,  with  a  single  slot  at  height  n,  is  added  to  each  pro¬ 
ducer  node.  This  pile  is  managed  in  the  same  way  as  the 
special  pile  of  the  sender  in  the  original  slide.  Whenever 
this  pile  is  vacant,  the  producer  can  locally  add  a  token 
to  the  slide.  A  consumer  node  is  allowed  to  extract  from 
its  piles  tokens  for  consumption;  This  is  in  addition  to 
its  ability  to  send  tokens  to  adjacent  nodes.  Whenever 
it  extracts  a  token  from  a  pile  it  has  to  report  this  event 
to  the  other  end  of  the  link  associated  with  this  pile,  as 
if  the  token  was  sent  over  a  link.  The  strong  fairness 
property  is  achieved  as  follows;  If  a  node  can  forward 
tokens  on  a  particular  link  infinitely  often,  then  tokens 
will  be  forwarded  on  the  link  infinitely  often.  Further¬ 
more,  we  assume  that  there  is  a  conceptual  link  from 
a  slot  in  a  pile  to  the  slot  below  it,  which  transfers  to¬ 
kens  down  if  the  slot  is  empty.  This  conceptual  link  is 
treated  as  a  regular  link  as  far  as  fairness  goes.  Finally, 
a  token  goes  down  only  one  slot  when  traversing  any 
link;  More  than  one  token  can  temporarily  reside  at  the 
same  slot.  However,  since  the  rules  for  sending  tokens 
over  a  physicetl  link  remain  the  same,  the  total  number 
of  tokens  in  a  pile  is  still  at  most  n.  The  mechanism 
described  above  is  a  version  of  slide  that  avoids  starva¬ 
tion.  We  will  use  a  generalized  slide  where  all  nodes  are 
both  consumers  and  producers.  Note  that  the  slide  of 
Section  3  is  a  generalized  slide  with  one  producer  and 
one  consumer. 

4.2  Combining  the  Algorithms 

To  build  the  combined  algorithm  we  run  two  separate 
algorithms,  a  dynamic  one  and  a  static  one,  both  re¬ 
ceiving  inputs  and  delivering  outputs  to  a  combination 
mechanism  that  outputs  a  single  output  for  the  prob¬ 
lem  being  solved.  We  assume  that  this  mechanism  uses 
at  most  a  constant  number  of  output  events  of  the  dy¬ 
namic  and  the  static  algorithms  in  order  to  produce  one 
output  event  for  the  whole  algorithm,  and  that  it  can 
continue  to  generate  output  events  even  if  one  of  the 
two  algorithms  halts.  There  are  three  major  steps  in 
building  the  combined  algorithm: 

1.  Whenever  a  node  senses  a  topological  change  on 
one  of  its  incident  links  it  produces  a  resource  to¬ 
ken-,  A  node  running  the  dynamic  algorithm  has  to 
consume  such  a  token  for  each  message  it  sends. 
Thus,  each  topological  change  creates  at  most  2  re¬ 
source  tokens,  one  per  each  node  at^acent  to  the 
topological  change.  These  two  tokens  can  account 
for  at  most  two  messages  sent  in  the  dynamic  algo¬ 
rithm.  The  generalized  slide  is  employed  to  spread 
around  the  resource  tokens.  If  the  new  resource  to¬ 


ken  cannot  be  accepted  by  the  slide  because  the 
node’s  pile  is  full,  the  resource  token  is  discarded. 
This  ties  the  dynamic  algorithm  to  the  topological 
changes. 

2.  A  reset  protocol,  [AAG87],  is  used  to  tie  the  static 
algorithm  to  the  topological  changes.  That  is,  the 
static  algorithm  should  abort  its  execution  if  topo¬ 
logical  changes  occur.  Moreover,  the  local  reset 
of  [AAG87,  AAM89]  guarantees  that  topological 
changes  in  remote  areas  of  the  network  do  not  un¬ 
necessarily  affect  the  static  algorithm. 

3.  We  now  have  two  independent  algorithms  running 
in  the  network.  The  question  still  stands  how  to 
distribute  the  inputs  among  them,  and  more  impor¬ 
tantly,  how  to  merge  their  outputs  into  one  correct 
output.  The  answer  to  this  question  is  dependent 
on  the  particular  problem  and  algorithms  used;  We 
discuss  in  the  next  section,  as  an  example,  the  case 
of  the  end-to-end  communication  problem. 

We  first  have  to  argue  that  the  combined  algorithm  is 
correct,  that  is,  eventually  a  new  output  event  of  either 
the  static  or  the  dynamic  algorithm  will  occur:  If  the 
network  topology  stabilizes  then  the  static  algorithm 
never  aborts  and  output  events  occur.  If  the  network 
never  stabilizes,  then  eventually  nodes  running  the  dy¬ 
namic  algorithm  will  have  a  token  to  consume  and  the 
algorithm  will  make  progress. 

For  problems  where  one  can  build  an  algorithm  that 
employs  two  totally  independent  algorithms,  uses  at 
most  a  constant  number  of  outputs  of  each  one  to  pro¬ 
duce  an  output  for  the  problem,  and  can  operate  even 
if  one  of  the  two  halts,  we  have  the  following: 

Theorem  1  Given  a  dynamic  algorithm  Ad  and  a 
static  algorithm  A,  with  communication  complexity  C$ 
per  output  event,  there  is  an  algorithm  that  has  the  fol¬ 
lowing  amortized  communication  complexity:  0{n-\-Ct+ 
m)  messages  per  topological  change  plus  0{Cs)  per  out¬ 
put  event,  where  n  and  m  are  the  number  of  nodes  and 
links  in  the  network,  respectively. 

Proof  Sketch.  All  messages  sent  by  the  combined  al¬ 
gorithm  are  sent  by  one  of  the  following  four  compo¬ 
nents:  the  dynamic  algorithm  Ad,  the  static  algorithm 
A,,  the  generalized  slide  and  the  reset  mechanism  of 
[AAG87]. 

The  dynamic  algorithm  consumes  a  resource  token  to 
send  each  message,  therefore  sending  at  most  0(1)  mes¬ 
sages  per  topological  change.  Each  token  of  the  gener¬ 
alized  slide  makes  at  most  n  hops  in  the  network,  thus 
the  generalized  slide  has  0(n)  amortized  message  com¬ 
plexity  per  topological  change.  The  amortized  commu¬ 
nication  complexity  of  the  reset  mechanism  [AAG87]  is 
0{m)  per  topological  change.  For  each  C,  messages  of 
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the  static  algorithm  either  at  least  one  output  event  of 
At  occurs,  or  the  output  is  purged  because  of  a  reset 
initiated  by  a  topological  change.  Since  after  some  con¬ 
stant  number  of  output  events  of  At  there  is  an  output 
event  of  the  combined  algorithm,  any  series  of  at  most 
0(Ct)  messages  of  At  can  be  charged  to  either  an  out¬ 
put  event  of  the  combined  algorithm  or  to  a  topological 
change.  □ 

4.3  Combined  end-to-end 

In  this  subsection,  we  outline  the  application  of  the  com¬ 
bining  methodology  to  create  a  dynamic  end-to-end  pro¬ 
tocol  that  diverges  into  the  static  algorithm  if  the  net¬ 
work  is  static  for  long  stretches  of  time. 

The  problem  of  end-to-end  communication  from  a 
sender  to  a  receiver  can  be  reduced  to  the  problem  of 
implementing  two  probes  between  these  two  processors, 
one  in  each  direction  (A  probe  implements  the  read  of 
a  remote  variable);  See  [AG88].  Moreover,  this  paper 
shows  how  two  independent  probes  in  the  same  direc¬ 
tion,  can  be  merged  into  a  single  probe. 

We  use  the  combining  methodology  to  solve  the  end- 
to-end  problem  as  follows:  We  apply  the  methodology 
to  static  and  dynamic  probe  algorithms,  dispersing  the 
inputs  and  merging  the  outputs  as  in  [AG88],  resulting 
in  a  combined  probe  (in  one  direction)  from  the  sender 
to  the  receiver.  Similarly,  a  corresponding  combined 
probe  is  constructed  in  the  opposite  direction.  The  re¬ 
sult  is  a  pair  of  efficient  probes,  one  in  each  direction, 
as  is  required  (by  the  reduction  of  end-to-end  to  probe). 

The  static  probe  algorithm  used  is  the  obvious  one- 
broadcast  and  echo  along  a  fixed  path.  The  dynamic 
algorithm  is  a  flood  forward  and  a  flood  back,  in  con¬ 
junction  with  the  bootstrap  mechanism  of  [AG91].  The 
end  result  is  an  end-to-end  protocol  whose  amortized 
communication  complexity  is  0{n  +  m)  messages  per 
topological  change  plus  0(n)  per  data  item.  In  partic¬ 
ular,  when  the  network  topology  stabilizes  it  achieves 
0{n)  message  complexity. 
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Lemma  10  In  any  time  interval  [to,  ti],  where  mu  new 
tokens  are  added  to  the  network,  at  most  0(n^m-|-n«w  n) 
token-passes  can  occur. 

Proof:  By  Lemma  9  the  total  number  of  tokens  in  the 
network  at  to  is  bounded  by  0(nm).  By  Lemma  6  each 
can  make  up  to  n  hops  in  the  network,  thus  contributing 
up  to  0(n*m)  token  passes.  Any  new  token  added  in 
the  time  interval  [fo,fi]  can  also  make  up  to  n  hops.  □ 

Theorem  11  //  5  and  R  are  eventually  connected,  S 
will  eventually  input  a  new  token. 
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Proof:  By  way  of  contradiction  assume  that  t  is  the 
last  time  the  sender  inputs  a  token. 

As  a  result  of  Lemma  10  and  as  there  is  only  one 
SIGNAL  message  per  each  token  passing,  there  is  a  time 
t'  >t  after  which  no  token  or  SIGNAL  message  is  sent. 
As  S  and  R  are  eventually  connected  there  is  a  path 
R  =  vo,vi, . . . ,  vie-i,Vk  =  S,  k  <  n,  such  that  for  each 
^  =  (i^ii  t'i+i)  is  viable,  hence  there  is  a 
time  t"  >  t'  by  which  all  messages  between  Vi  and  Wj+i 
have  been  delivered. 

By  induction  on  the  length  of  the  viable  path  from  w, 
to  R,  we  can  show  that  Vi  cannot  have  a  token  in  level 
>  i  after  time  t"  . 

The  receiver,  vq,  has  no  tokens  stored  at  all.  De¬ 
note  by  e  the  link  between  v,-  and  v.-i  (i  >  1),  and 
assume  the  induction  assumption  that  Vi-i  has  no  to¬ 
ken  stored  at  level  >  i  —  1.  Since  at  t"  all  mes¬ 
sages  between  Vj-i  and  have  arrived,  by  Lemma  4 
6ound[e]i^  =  +  1;  by  fbe  induction  assump¬ 
tion  <  »  —  1  1  thus  6o«nd[c]‘"  <  i.  As  t"  >  t', 

no  token  is  sent  after  t",  but  according  to  the  code  this 
can  happen  only  if  Vi  does  not  have  tokens  of  level  t  +  1 
or  more,  proving  the  induction  step. 

Since  k  <  n,  S  does  not  have  a  token  of  level  n  at 
t",  and  by  the  code  will  enable  input  of  a  new  token, 
contradicting  the  assumption  that  t  is  the  last  time  S 
’nputs  a  token.  □ 

A.  2  Complexity 

Lemma  12  The  number  of  messages  sent  bg  slide  in 
any  time  interval  where  new  new  tokens  are  input  by 
the  sender  is  bounded  by  0(n^m  +  now  •  n). 

Corollary  13  The  number  of  bits  sent  in  the  slide  pro¬ 
tocol  in  any  time  interval  where  new  new  tokens  are  in¬ 
put  by  the  sender  is  bounded  by  0{n^m  -|-  new  ■  n)L), 
where  L  is  the  maximal  number  of  bits  in  a  token. 

Lemma  14  The  space  needed  at  each  node  is  O(nL) 
per  incident  link,  where  L  is  the  maximal  number  of 
bits  in  a  token. 

B  The  Majority  Algorithm 

B. l  Correctness  Proof 

Definition  15  inf*'*’)  is  the  number  of  tokens  deposited 
by  the  sender  into  the  slide  in  the  time  interval  [^,1']. 

Definition  16  is  the  number  of  tokens  received 

by  the  receiver  from  the  slide  in  the  time  interval 

Definition  17  delaif  is  the  number  of  tokens  delayed 
by  the  slide  at  time  t. 


Theorem  18  (Safety)  At  any  time  the  output  of  the 
receiver  is  a  pr'fix  of  the  input  of  the  sender. 

Proof:  We  denote  by  /  =  {/i, /2, . . .}  and  by 

O  =  {Oi,  O2,  ■ . .}  the  input  to  the  sender  and  the  out¬ 
put  of  the  receiver,  respectively.  Denote  by  to  some  time 
before  the  beginning  of  the  execution  of  the  algorithm, 
and  by  tj,  i  >  0  the  time  at  which  O,  is  output. 

To  prove  the  theorem,  we  claim  that  the  majority  of 
the  tokens  received  by  the  receiver  in  the  interval  of  time 
[t,_i,tj]  carry  data  item  .  First  we  show  that  no  token 
carrying  /*,  ifc  >  i  could  have  been  received  before  t,-. 
By  the  code,  the  total  number  of  tokens  that  have  been 
received  by  the  receiver  until  time  t,-  is: 

1  4.  _  1)(2  Cap -I-  1). 

Since  the  network  capacity  is  Cap,  the  total  number  of 
tokens  sent  by  the  sender  at  any  time  t  is  at  most  Cap 
more  than  the  total  received  by  the  receiver  at  the  same 
time,  t.  Thus, 

in[‘o,t.]<  j(2.Cap-hl)  (1) 

Therefore,  no  token  carrying  /*,  k  >  i  can  be  sent  by 
the  sender  before  .  Hence,  no  such  token  can  be  re¬ 
ceived  by  the  receiver  at  t,  <  <  tj. 

We  claim  that  no  more  than  Cap  tokens  containing 
data  item  /*,  /b  <  i  may  be  received  in  the  interval 
of  time  This  completes  the  proof  of  the  safety 

property  because  together  with  the  above  it  implies  that 
from  the  2  •  Cop-f  1  tokens  received  in  at  least 

Cap+  1  of  them  carry  data  item  U. 

To  prove  the  claim  we  distinguish  between  two  sets 
of  tokens,  those  that  carry  data  items  /*,  k  <  i,  which 
we  call  old,  and  the  all  other  tokens.  We  have  already 
proved  that  all  the  tokens  received  until  ti_i  are  old 
and  that  the  total  number  of  such  tokens  received  by 
the  receiver  until  t,_i  is  (2  Cap-f- 1)(»  —  1)  —  Cap.  Since 
the  total  number  of  old  tokens  ever  sent  by  the  sender 
is  (2  •  Cap  -f-  l)(i  —  1),  at  most  Cap  may  be  received  by 
the  receiver  in  the  interval  of  time  [fj-i,  ti].  O 

Theorem  19  (Liveness)  If  the  sender  and  the  re¬ 
ceiver  are  eventually  connected,  then  the  receiver  will 
eventually  output  any  data  item  input  by  the  sender. 

B.2  Complexity 

Lemma  20  The  message  complexity  of  the  majority  al¬ 
gorithm  is  0(n^m). 

Proof:  Let  T<  be  the  interval  of  time  from  the  time 
Oi-i  is  output  to  the  time  Oi  is  output.  Clearly  in  Ti 
the  receiver  receives  2  Cap-f-  1  tokens.  Since  the  slide 
can  hold  at  most  Cap  tokens,  at  most  3  Cap  -|- 1  tokens 
are  sent  by  the  sender  in  7^  ,  and  the  lemma  follows.  □ 
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Corollary  21  The  bii  communication  complexity  of  the 
majority  algorithm  is  0{n^mD),  where  D  is  the  size  in 
bits  of  a  data  item. 

Lemma  22  The  space  complexity  of  the  majority  algo¬ 
rithm  is  0{nD),  where  D  is  the  size  in  bits  of  a  data- 
item. 

Proof:  Each  token  sent  in  the  majority  algorithm  con¬ 
sists  of  0(D)  bits.  Applying  this  to  Lemma  14,  we  get 
the  space  complexity  of  O(nD)  bits.  □ 

C  The  Data  Dispersal  Algo¬ 
rithm 

C.l  Correctness  Proof 

Lemma  23  Whenever  the  sender  tries  to  extract  a  label 
from  C,  C  is  not  empty. 

Theorem  24  (Safety)  At  any  time  the  output  of  the 
receiver  is  a  prefix  of  the  input  of  the  sender. 

Proof:  We  denote  by  /  =  {/i./j,  •  •  and  by  O  = 
{Oi,  O2, . . .}  the  input  to  the  sender  and  the  output  of 
the  receiver,  respectively.  Let  ti  be  the  time  when  the 
receiver  outputs  the  i’th  data  item,  and  denote  by  /j 
the  label  added  to  the  2-Cap -f  1  packets  calculated  from 
li  at  the  sender.  By  the  code,  the  tokens  used  at  U 
to  calculate  the  t’th  data  item  at  the  receiver  are  the  2- 
Cap-t-1  tokens  received  by  it  in  the  time  interval 
By  the  same  arguments  used  for  the  safety  proof  of  the 
majority  algorithm  (Theorem  18),  at  least  Cap  +  1  of 
these  tokens  contain  the  label  /<;  thus  the  majority  of 
labels  will  be  and  the  receiver  will  calculate  the  I’th 
data  item  from  the  tokens  containing  /<.  Since  at  the 
time  the  sender  extracts  /,  from  C  there  is  no  token 
containing  it  in  the  network,  the  IDA  at  the  receiver  at 
ti  wi.i  use  only  packets  calculated  from  /,  at  the  sender. 
As  noted  before  the  receiver  has  at  least  Cap  -1-  1  such 
packets  at  ti,  and  the  IDA  will  correctly  calculate  /,  at 
ti .  Thus  Oi  =  li  for  any  i.  □ 

Lemma  25  For  any  time  t  missing  <  4-Cap+  1. 

Proof:  The  variable  missing  is  incremented  when  a 
token  is  extracted  from  send-buf  fer.  The  only  event 
where  tokens  are  added  to  sendJbuf  fer  is  when  2Cap-\- 1 
tokens  are  added  to  it  when  it  is  empty  and  missing  < 
2-Cap.  Thus  at  any  time  missing  <  4-Cap -h  1.  □ 

Note  that  this  also  implies  that  the  receiver  never 
stores  more  than  4-Cap-(- 1  tokens  in  all  its  buffers. 

Lemma  26  If  the  sender  and  the  receiver  are  eventu¬ 
ally  connected,  there  is  no  dead-lock  at  the  sender  (even¬ 
tually  missing  <  2  Cap). 


Theorem  27  (Liveness)  If  the  sender  and  the  re¬ 
ceiver  are  eventually  connected,  then  the  receiver  will 
eventually  output  any  data  item  input  by  the  sender. 

C.2  Complexity 

Lemma  28  The  message  complexity  of  the  data  disper¬ 
sal  algorithm  is  0{n^m)  messages. 

Proof:  Denote  by  U  the  time  the  t’th  data  item  is  out¬ 
put  at  the  receiver.  We  use  for  the  slide  from  the  sender 
to  the  receiver  the  same  notation  as  in  Section  B.l.  By 
the  code  =  2-Cap  4- 1,  and  0  <  delay^  <  Cap, 

thus  Cap  4-  1  <  <  Z-Cap  4-  1.  By  Lemma  9 

Cap  =  0(nm),  and  applying  this  to  Lemma  12  yields 
a  message  complexity  of  O(n^m)  for  the  slide  from  the 
sender  to  the  receiver. 

The  tokens  that  are  sent  through  the  slide  from  the 
receiver  to  the  sender  in  the  time  interval  [t,-,  must 
be  in  totensJo.return  just  after  ti,  since  new  tokens  are 
added  to  this  set  only  at  output  events  at  the  receiver. 
By  Lemma  25  the  receiver  stores  at  any  time  >.  most 
4-Cap-t- 1  tokens.  In  the  worst  case  all  of  them  are  in 
tokens Jo.return  at  ti.  Thus  at  most  4Cap4-l  tokens  are 
input  to  this  slide  in  the  time  interval  [f,-,  Applying 
this  to  the  results  of  Lemmas  9  and  12  we  obtain  a 
message  complexity  of  D(n^m)  for  this  slide. 

Combing  the  two  slide  protocols  we  obtain  a  message 
complexity  of  O(n^m)  for  the  data  dispersal  algorithm. 
O 

Corollary  29  The  bit  communication  complexity  of  the 
data  dispersal  algorithm  is  0{nD),  where  D  is  the  size 
in  bits  of  a  data  item. 

Proof:  Each  token  sent  in  the  data  dispersal  algorithm 
consists  of  a  packet  of  size  ^(gj^^),  a  label  of  size 
0(log  n),  and  a  serial  number  of  size  O(log  n).  The  mes¬ 
sage  complexity  of  the  algorithm  is  O(n^m),  and  half 
of  the  messages  are  of  size  +  logn),  while  the 

other  half  have  constant  size.  The  total  number  of  bits 
sent  between  any  two  consecutive  output  events  at  the 
receiver  is,  therefore,  0{{n^m){^J^^^  4- logn)).  Since 
Cap  =  0(nm)  this  is  0(nD  -b  n*mlogn).  For  the  data 
dispersal  algorithm  the  data  item  is  of  size  n(nm  log  n), 
therefore  the  bit  complexity  is  0(nD).  □ 

Lemma  30  The  space  complexity  of  the  data  dispersal 
algorithm  isO{^  +  nlogn),  where  D  is  the  size  in  bits 
of  a  data  item. 

Proof:  Each  token  sent  in  the  algorithm  is  of  size 
4-  logn).  By  applying  Lemma  14  we  get  the 
space  complexity  of  0{n{-^^^  -b  logn)).  Since  Cap  = 
0(nm)  the  space  complexity  is  0(^  -b  nlogn).  □ 
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Computing  with  Faulty  Shared  Memory 

(Extended  Abstract) 


Yehuda  Afek*  David  S.  Greenberg^ 

Abstract.  This  paper  addresses  problems  which  arise  in 
the  synchronization  and  coordination  of  distributed  systems 
which  employ  unreliable  shared  memory.  We  present  algo¬ 
rithms  which  solve  the  consensus  problem,  and  which  sim¬ 
ulate  reliable  shared-memory  objects,  despite  the  fact  that 
the  available  memory  objects  (e.g.  read/write  registers,  test- 
and-set  registers,  read-modify-write  registers)  may  be  faulty. 

1  Introduction 

Research  on  fault- tolerant,  shared-memory  systems  typ¬ 
ically  studies  processor  failures  and  assumes  that  the 
shared  memory  is  reliable  [Lam86,  Her91,  BP87,  VA86, 
Blo87,  SAG87,  CIL87,  Abr88,  PI088,  Rab82,  AAD+90]. 
However,  memories  do  fail.  For  example,  a  shared  reg¬ 
ister  might,  after  failing,  return  some  arbitrary  value  to 
subsequent  read  operations.  This  paper  investigates  the 
effects  of  shared  memory  failures  in  distributed  systems. 

One  common  technique  for  dealing  with  faulty  mem¬ 
ory  is  to  keep  many  copies  of  each  datum  or,  in  general, 
to  use  some  type  of  redundant  coding  which  allows  er¬ 
rors  to  be  detected  and/or  corrected.  But  if  the  redun¬ 
dancy  is  included  in  the  value  of  a  single  register,  then 
an  arbitrary  spontaneous  change  of  the  entire  contents 
of  the  register  will  not  be  detected.  Additionally,  the 
increased  size  of  the  register  makes  it  more  difficult  to 
build  -  and  perhaps  more  prone  to  failure. 

In  a  shared  memory  environment  these  techniques  are 
less  useful.  A  shared  memory  for  a  distributed  system 
is  much  more  complex  to  implement  than  the  memory 
of  a  uni-processor.  Common  primitives  such  as  read- 
modify-write  or  even  simple  reads  and  writes  require  a 
memory  which  is  much  more  than  a  passive  repository  of 
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data  (see,  e.g.  [Smi82]).  If  redundancy  is  spread  across 
registers  then  the  encoding  is  subject  to  the  timing  in¬ 
consistencies  which  are  the  bane  of  all  distributed  algo¬ 
rithms.  Strategies  for  rectifying  shared  memory  faults 
will  have  to  incorporate  distributed  coordination  tech¬ 
niques.  Hence,  the  need  arises  to  understand  the  power 
of  memories,  some  of  whose  cells  may  be  faulty.  Al¬ 
gorithms  resistent  to  memory  faults  can  be  used  di¬ 
rectly  to  create  reliable  applications.  Alternatively,  a 
reliable  shared  memory  can  be  implemented  from  un¬ 
reliable  memories,  thereby  providing  reliable  primitives 
to  applications  which  cannot  tolerate  faulty  memory. 

We  show  that  both  these  approaches  are  viable.  Af¬ 
ter  a  discussion  of  memory  failures  and  specification  of 
faulty  memory  primitives  (in  Section  2),  the  body  of  this 
paper  presents  a  sequence  of  algorithms  that  use  faulty 
shared-memory.  Section  3  begins  with  algorithms  for 
implementing  reliable  read/ write  memories  from  faulty 
memories.  Following  Lamport  [Lam86],  we  present  sim¬ 
ple  constructions  of  safe  and  regular  registers  and  con¬ 
clude  with  a  construction  of  a  reliable  atomic  register 
from  faulty  atomic  registers. 

We  turn  next  to  studying  the  consensus  problem 
in  shared  memory  models,  using  unreliable  versions  of 
the  more  powerful  primitives  test-and-set  (Section  4) 
and  read-modify-write  (Section  5).  The  algorithms  we 
present  demonstrate  that  faults  do  not  qualitatively  de¬ 
crease  the  power  of  these  primitives,  in  that  they  re- 
t«un  their  positions  in  the  memory  hierarchy  of  Her- 
lihy  [Her91].  Moreover,  in  combination  with  the  earlier 
register  results,  our  consensus  algorithms  can  be  used  to 
implement  the  universal  construction  of  Herlihy  [Her91]. 
Hence,  for  example,  faulty  read-modify-write  primitives 
can  be  used  to  implement  any  shared  object.  The  pa¬ 
per  closes  with  a  discussion  of  related  work  and  open 
problems. 

2  The  model 

Perhaps  the  most  general  memory  fault  is  a  spontaneous 
change  in  the  value  of  a  register.  Certainly  if  every  regis¬ 
ter  can  spontaneously  change  to  a  new  value  at  any  time 
then  the  situation  is  hopeless.  Our  goal  is  to  explore  how 
much  of  the  memory  can  be  faulty  while  still  allowing 
a  specific  problem  to  be  solved,  or  a  fault-free  memory 
to  be  simulated.  We  will  therefore  restrict  the  number 
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of  memory  objects  which  can  change  and/or  the  num¬ 
ber  of  times  they  can  change.  Other  less  general  types 
of  faulty  behaviors  might  include  registers  which  never 
change  their  value  (after  becoming  faulty  all  reads  re¬ 
turn  the  same  value  regardless  of  any  write  command), 
registers  which  occasionally  miss  a  write  (the  written 
value  is  never  available  to  a  read),  and  registers  which 
occasionally  return  the  wrong  value  (some  reads  return 
arbitrary  values  or  return  values  which  are  not  consis¬ 
tent  with  any  ordering  of  the  writes).  Alternatively,  the 
timing  of  the  faults  might  be  restricted;  for  example,  all 
memory  faults  might  occur  before  any  processor  takes 
a  step. 

2.1  Specifying  faulty  memory 

We  consider  a  collection  of  asynchronous  processors 
which  communicate  via  a  shared  memory.  The  shared- 
memory  may  consist  of  a  variety  of  shared  data  ob¬ 
jects.  This  shared  memory  is  subject  to  memory  fail¬ 
ures,  each  modeled  as  a  write  to  a  shared  data  object 
which  is  atomic  with  respect  to  the  processors’  opera¬ 
tions  on  that  object.  (The  local,  non-shared  memory 
for  each  processor  is  assumed  to  be  reliable.  Error- 
correcting  codes  and  similar  redundant  techniques  may 
be  employed  in  local  memories,  where  issues  of  commu¬ 
nication,  synchronization  and  processor  failures  are  less 
critical.) 

Faulty  variants  of  atomic  data  objects  are  specified 
as  follows.  Each  reliable  object  X  has  a  type,  which 
defines  a  set  of  possible  values-,  a  set  of  primitive  opera¬ 
tions,  which  provide  the  only  means  to  manipulate  that 
object;  and  a  set  of  runs,  which  constitute  the  sequen¬ 
tial  specification  of  X  and  define  how  the  object  behaves 
when  its  operations  are  invoked  one  at  a  time  [HW90]. 
A  faulty  object,  X,  extends  the  set  of  operations  with  a 
set  of  write(u)  operations  (for  any  v  which  is  ever  a  le¬ 
gal  value  of  X)  invocations  of  which  constitute  failures 
of  X.  The  sequential  specification  of  the  faulty  object 
now  includes  failures  of  X.  Provided  serial  operations 
on  a  reliable  X  are  total,  the  sequential  specification 
of  a  faulty  object  is  a  function  of  that  of  the  reliable 
object,  in  the  obvious  way.  (If  serial  operations  on  a  re¬ 
liable  X  ate  not  defined  on  some  values,  a  definition  of 
the  faulty  serial  specification,  specific  to  X,  may  be  re¬ 
quired.)  Given  the  sequential  specification,  a  complete 
concurrent  specification  may  be  derived  using  any  of  a 
number  of  techniques  or  styles  [Lam86,  HW90,  VA86, 
AAD+90,  BP87,  Blo87,  Her91,  LT87]. 

Some  widely-studied  objects  are  not  atomic — the  reg¬ 
ular  and  safe  registers  defined  by  Lamport  are  the  best- 
known  examples  [Lam86].  In  these  particular  instances, 
write  operations  are  already  defined  on  the  objects,  and 
it  is  straightforward  to  extend  their  specification  to  in¬ 
clude  faulty  write  operations  in  addition  to  those  in¬ 
voked  by  processors. 


As  observed  above,  some  constraints  must  be  imposed 
on  the  occurance  of  memory  faults: 

•  We  use  m  to  denote  the  total  number  of  memory 
failures  (faulty  write  operations)  in  a  run  (or  collec¬ 
tion  of  runs)  of  a  system. 

•  We  use  /  to  denote  the  total  number  of  data  ob¬ 
jects  in  a  system  that  may  be  affected  by  memory 
failures  in  a  run  (or  collection  of  runs)  of  a  system. 

•  A  data  object  is  k-faulty  if  it  suffers  at  most  k  fail¬ 
ures,  i.e.,  at  most  k  faulty  write  operations  to  it  are 
invoked  during  each  run.  A  data  object  is  oo-faulty 
if  there  is  no  finite  bound  on  the  number  of  failures 
it  suffers. 

In  algorithms  working  with  faulty  shared  memory,  we 
generally  require  the  algorithms  to  be  strongly-wait-free; 
any  operation  by  a  processor  must  terminate  its  execu¬ 
tion,  regardless  of  the  number  of  shared  memory  faults 
and  independent  of  the  steps  taken  by  other  processors. 
Thus,  a  strongly-wait-free  algorithm  may  correctly  im¬ 
plement  a  shared  object  for  only  a  bounded  number  of 
memory  faults,  but  each  high-level  operation  by  a  non- 
faulty  processor  must  still  terminate,  even  if  the  bound 
on  the  number  of  memory  faults  is  exceeded  in  a  given 
run. 

In  order  to  quantify  the  cost  of  having  a  certain  num¬ 
ber  and  type  of  memory  faults,  we  define  a  function, 
CONS,  which  represents  the  number  of  copies  of  one 
object,  some  of  which  are  faulty,  that  are  necessary  to 
construct  another  object: 

Definition 

C0NS(A’,  m,y)  1=  the  number  of  objects  of  type 
X  required  to  construct  one  non-faulty  object  of 
type  Y,  assuming  there  may  be  at  most  m  memory 
faults  among  the  type  X  objects. 

CO'SSk{X,f,Y)  the  number  of  objects  of  type 
X  required  to  construct  one  non-faulty  object  of 
type  Y,  assuming  at  most  /  of  the  type  X  objects 
may  be  k-faulty  (k  can  be  oo). 

2.2  A  general  construction 

We  start  with  a  simple  theorem  in  which  we  show  how 
a  solution  that  tolerates  one  faulty  register  can  be  im¬ 
proved  to  tolerate  /  faulty  registers.^ 

Theorem  1  For  any  f  >  1,  COfiSoo{X,f,X)  < 

(2/)'°8CONSoc(X.i,Jf)  ^  (coNSoo(A-,1,A:))1+'°8/. 

•  Throughout,  logarithms  are  base  2  unless  otherwise  noted. 
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Proof:  Assume  there  is  a  construction  of  a  reliable  ob¬ 
ject  of  type  X,  using  C  =CONSoo(Jf,  1,  A’)  strongly- 
wait-free  objects  of  type  X,  one  of  which  may  be  oo- 
faulty. 

The  goal  is  to  construct  a  strongly-wait-free  object  X 
that  is  reliable,  even  if  /  of  the  component  registers  are 
<»-faulty.  That  is,  to  construct  a  reliable  object  X  from 
a  set  of  objects  X,f  of  which  might  be  faulty.  Consider 
the  following  recursive  construction:  we  first  construct 
C  strongly- wait-free  objects  of  type  X,  each  of  which  is 
resilient  to  [//2J  faulty  registers,  then  we  use  these  C 
objects  to  construct  a  single  object  X,  using  the  single 
fault  construction.  The  result  is  an  object  that  can  tol¬ 
erate  the  failure  of  one  of  the  embedded  [//2J -memory 
resilient  objects.  For  one  of  the  embedded  objects  to 
fail,  there  must  be  at  least  f//2'|  memory  faults  in  it. 
Since  the  total  number  of  faults  is  /  =  [//2]  +  l//2j, 
none  of  the  other  embedded  objects  is  faulty.  Hence,  the 
final  construction  is  tolerant  of  at  least  /  memory  faults. 
The  total  number  of  X  objects  used  in  this  construction 
is:  CONSoo{XJ,X) 

=  CONSoo(.?f,l,X)-CONSoo(^,  Lf/2J,-X’) 

<  CONSoo(X,l,A-)L'°8fJ+i 

<  CONSoo(:^,1,A')'°«/+^ 

_  ^2/)‘°«<^ONS«(X,i.X)  I 

A  similar  recursive  construction  can  be  based  on  any 
self-construction  resilient  to  c  >  1  register  faults.  More¬ 
over,  the  same  construction  works  for  weaker  types  of 
failure: 

Theorem  2 

For  any  f  and  c,  /  >  c  >  0,  and  for  all  k  e  {1,  ...}u{oo}, 

COTiSk(X,f,X)  < 

CONS(A,m,A’)  <  ((c-t-l)m)‘'’*«=+» 

Fault-tolerant  constructions  can  be  composed  with 
fault-intolerant  constructions: 

Theorem  3 

CONSk{X,f,Z)  <  CONSfc(A’,/,r)  •CONSfc(y,0,Z); 
CONS(X,  m,  Z)  <  CONS(A’,  m,  Y)  ■  CONS(r,  0,  Z). 

3  Read/ Write  registers 

One  approach  to  tolerating  faulty  registers  is  to  add  a 
software  layer  between  the  faulty  hardware  and  the  user 
which  looks  to  the  user  like  fault-free  hardware.  In  this 
section  we  show  that  this  is  possible  for  safe,  regular, 
and  atomic  registers.  That  is,  we  present  constructions 
of  safe,  regular,  and  atomic  registers  from  a  collection 
of  the  corresponding  primitives,  /  of  which  may  be  oo- 
faulty. 


Theorem  4  One  reliable,  strongly-vmit-free,  safe  regis¬ 
ter  can  be  constructed  from  2f+l  similar  registers,  f  of 
which  may  be  oo-favlty:  COf1Soo{safe,f,safe)  =  2/-f  1. 

Proof:  For  the  upper  bound,  the  obvious  construction 
works:  the  writer  writes  the  2f  +  l  registers,  and  the 
reader  reads  them.  If  the  reader  sees  a  majority  value 
(/+!  with  the  same  value),  it  returns  that  value,  other¬ 
wise  it  returns  any  value.  The  lower  bound  is  similar  to 
the  proof  of  Theorem  11.  (Note  that  this  result  holds 
for  multi-reader/multi-writer  safe  registers.  In  what  fol¬ 
lows,  registers  are  assumed  to  be  single-reader/single¬ 
writer.)  I 

Since  a  single  (reliable)  safe  bit  is  sufiScient  to  imple¬ 
ment  a  regular  bit  [Lam86],  it  follows  from  Theorem  4 
that  CONSoo{binary.safe,  f,  binary. regular)  =  2/  +  1. 
Moveover,  given  Theorem  4,  one  can  construct  any 
(multi-reader/malti-writer,  arbitrary  value)  atomic  reg¬ 
ister  using  constructions  from  safe  bits  [Pet83,  Lam86, 
BP87,  PB87,  Blo87,  SAG87,  LTV89,  1^089].  For  ex¬ 
ample,  a  construction  due  to  Tromp  ['L:o89]  produces 
a  binary  atomic  register  horn  3  safe  bits  (3  are  neces¬ 
sary  [Lam86]), 

CONSoo ( binary. safe , /,  binary. atomic) 

<  COfiSoo{binary.safe,f,binary.safe) 

•  CONSoo(.binary.safe,  0,  binary. atomic) 

=  6/ -I- 3. 

Define  a  V-register  to  be  a  read/ write  register  on 
(arbitrary)  value  domain  V.  A  (fault-intolerant)  con¬ 
struction  by  Peterson  [Pet83]  produces  an  atomic  V- 
register  from  3  safe  V-registers  and  4  atomic  binary 
registers.  Another  construction  due  to  Tromp  [Tro89] 
produces  an  atomic  F-register  from  4  safe  ^-registers 
and  8  safe  binary  registers.  Both  these  constructions 
can  be  composed  with  the  construction  in  Theorem  4, 
to  construct  reliable  atomic  V-registers  from  unreliable 
safe  y-registers  and  unreliable  safe  bits.  The  construc¬ 
tion  in  Figure  1  constructs  a  reliable  atomic  y-register 
directly  from  unreliable  components,  specifically,  8/4-2 
atomic  y-registers  and  2  reliable  atomic  bits.  (The 
proof  of  this  construction  is  omitted  from  the  extended 
abstract.)  Using  Theorem  4  and  the  constructions  just 
discussed  of  atomic  bits  from  safe  bits,  we  have: 

Theorem  5  One  reliable,  strongly-wait-free,  atomic  V - 
register  can  be  constructed  from  the  combinations  of 
components  below,  where  in  each  case,  up  to  f  of  the 
combined  components  may  be  oo -faulty: 

•  6/  -f  3  safe  V -registers  and  24 f  +  12  safe  binary 
registers  (jPet83]  and  Theorem  4)- 

•  8/4-4  safe  V -registers  and  16/  4-  8  safe  binary 
registers  ({Tro89]  ond  Theorem  4)- 

•  8/4-2  atomic  V -registers  and  12/4-6  safe  binary 
registers  (Figure  1). 
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R,W-.  reliable  atomic  on  {0, 1} 

val(0..1,  1..4/+1]:  2-D  array,  atomic  in  Value,  unreliable 
Writer's  protocol: 

function  wTitt{v)  %  v  is  a  value 

ptr  :=  ->R  %  copy  not  being  used  by  reader 

for  t  =  1  to  4/-f  1  do  val[pir,i]  :=  v  od 
W  :=  ptr  %  tell  future  readers  where  to  look 

end^unction 

Reader’s  protocol 

function  read:  value 

sau)IV,prev:  persistent  across  invocations 

%  initially  false  and  0,  resp. 
newW  :=  W  %  where  writer  last  finish^ 

if  R  =  newW  and  sawW  %  no  evidence  of  a  write 
then  return(prev)  %  return  last  value  read 

eke  R  :=  newW  %  read  copy  last  writen 

sawW  :=  false  %  remember  state 

for  i  =  4/+1  down  to  1  do 
tmp[t]  :=  val[iZ,  t] 

if  a  ^  6,  is  a  subseq.  of  tmp[1..4/+l] 

then  %  must  have  overlapped  a  write(a) 

sawW  :=  true  %  remember  state 

prev  ::=  a  %  return  a  until  state  changes 

return(a) 

else  %  don’t  know  value  of  a  concurrent  write 
return(the  majority  value  in  tmp[1..4/+l]) 
end-function 

Figure  1:  A  reliable  atomic  register 


If  we  consider  the  construction  of  atomic  registers  only 
from  faulty  atomic  registers  of  the  same  type,  the  con¬ 
struction  from  Figure  1  dominates  (using  V-registers  to 
implement  safe  bits): 

Corollary  6  CONSoo(ofom«c,/,  atomic)  <  20/  +  8. 

4  Test-and-set  registers 

Unfortunately,  atomic  registers  do  not  provide  a  very 
strong  memory  primitive.  Even  the  simple  task  of  two- 
processor  consensus  is  impossible  with  just  atomic  reg¬ 
isters.  Such  tasks  require  a  stronger  primitive  such  as 
test-and-set. 

A  two-processor,  binary,  test-and-set  register  is  a  con¬ 
current  object  accessible  by  two  processors  through  the 
operations  test&set  and  reset.  The  sequential  specifica¬ 
tion  of  the  object  is  most  simply  understood  as  opera¬ 
tions  on  a  binary  register,  initialized  to  0.  The  test&set 
operation  atomically  reads  the  register,  writes  1  into  it, 
and  returns  the  value  read.  The  reset  operation  writes 
0.  If  the  object  is  faulty,  the  failure  operations  write(O) 
and  write(l)  have  the  obvious  effect. 

The  processors  are  constrained  in  their  use  of  the  re¬ 
set  operation  -  a  processor  should  only  invoke  the  reset 
operation  if  its  previous  operation  on  the  object  was  a 


test&set  that  returned  0.  If  the  processors  violate  this 
well-formedness  condition  then  the  object  may  exhibit 
arbitrary  behavior. 

A  single-use  test-and-set  has  no  reset  operation.  (Ex¬ 
cept  in  the  statements  of  the  theorems,  we  will  say 
“test-and-set”  instead  of  “two-processor,  binsiry  test- 
and-set”.) 

In  this  section  we  show  that  reliable  test-and-set  reg¬ 
isters  can  be  constructed  from  unreliable  test-and-set 
registers.  We  will  allow  the  test-and-set  constructions 
to  utilize  reliable  atomic  read/write  registers.  (These 
weaker  but  reliable  registers  can,  of  course,  be  con¬ 
structed  from  a  set  containing  unreliable  components 
using  the  techniques  of  the  last  section.)  We  first  show 
the  constructions  for  single-use  test-and-set  and  then 
extend  to  multi-use. 

Theorem  7 

•  There  is  a  strongly-wait-free,  two-processor,  single¬ 
use,  binary,  test-and-set  algorithm  using  7  binary 
test-and-set  registers,  1  of  which  may  be  oo-favlty 
(Figures  2  and  3). 

•  There  is  a  strongly-wait-free,  two-processor,  binary, 
test-and-set  algorithm  using  14  binary  test-and-set 
registers,  1  of  which  may  be  oo-favlty,  and  4  re¬ 
liable,  binary,  atomic  read/write  registers.  (Fig¬ 
ure!). 

Corollary  8  There  is  a  strongly-wait-free, 
two-processor,  binary,  test-and-set  algorithm  using 
14/108 test-and-sets,  f  of  which  may  be  oo-favlty,  and 
4(2/)'°*  =  56/'°*  reliable,  binary,  atomic  registers. 


Protocol  for  processor  p:  %  Other  proc.  is  q. 

shared  A[1..3],  fi,C[1..3]:  T&S  objects,  initially  0 

function  s-test&set  %  return  win  (0)  or  lose  (1) 

sum  :=  0 

for  t  =  1  to  3  do 

sum  :=  sum  +  testitset(A[i]) 
if  sum  <  2  then  goto  C  %p  won  A 

eke  if  test&set(B)=  0  then  return(l) 
eke 

C:  sum  :=  0 

for  ii  =  1  to  3 

do  sum  :=  sum  test4tset((7(t]) 
if  sum  <  2  then  return(O)  %  p  won  C 

eke  return(l)  %  p  lost  C 

end-function 

Figure  2:  Single-use  test-and-set. 

Uses  7  test-and-set’s,  1  of  which  may  be  faulty. 

Since  two-processor  consensus  can  be  implemented 
with  three  single-use  test-and-set  registers  [LAST],  we 
have: 

COtiSao{single.use.TkS,  1,  two. consensus))  <  21. 
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win  lose  lose 


Figure  3:  Schematic  of  single-use 
test-and-set  (from  Figure  2). 


Moreover,  mutual  exclusion  is  possible  by  using  multi¬ 
use  test-and-sets  as  critical  section  locks.  However,  bi¬ 
nary  test-and-set  is  too  weak  to  implement  consensus 
for  more  than  2  processors  [LA87]. 

Theorem  7  follows  from  the  constructions  in  Fig¬ 
ures  2,  3  and  4.  (Details  are  omitted  from  this  extended 
abstract.) 

5  RMW  registers  and  consensus 

To  complete  our  investigation  of  memory  fault-tolerance 
we  turn  to  an  object  from  the  top  of  Herlihy’s  hierar¬ 
chy  of  shared  memory  objects,  the  read-modify-write 
register.  Herlihy  showed  that  reliable  read-modify-write 
(RMW)  is  a  universal  shared-memory  primitive  [Her91]. 
Any  other  shared  object  can  be  simulated  using  RMW. 
This  follows  because  RMW  registers  can  be  used  to  solve 
n-processor  consensus,  which  is  itself  universal. 

Briefly,  RMW  registers  enable  a  processor  to  atom¬ 
ically  read  a  register  and,  based  on  the  value  read,  to 
write  a  new  value.  In  the  consensus  problem  there 
are  n  processors,  each  with  an  input  value,  inputp  € 
{— 1,-H}.  A  processor  decides  on  a  value  output  if  it 
writes  output  to  its  write-once  output  register.  The  re¬ 
quirements  of  the  consensus  problem  are  that  there  exist 
a  decision  value  v  such  that  each  non-faulty  processor 
eventually  decides  on  v  and  that  v  is  the  input  value  of 
at  least  one  processor. 

5.1  Bounded  failures  per  register 

For  a  bounded  number  of  faults  per  register  we  are  able 
to  fully  characterize  the  number  of  RMW  registers  re¬ 
quired  for  consensus; 


Protocol  for  processor  p:  %  Other  proc.  is  q. 

shared  T5[0..1]:  sin^e-use  tiu  objects 
with  s-test4cset  operation 
written  only  by  winner,  initially  0 
lose,  current:  reliable  atomic  on  {0, 1} 

%  written  only  by  owner,  initially  0 
myjcurrent\p..q]:  reliable  atomic  on  {0, 1} 

function  testjcset 
t  :=  current 
myjcurre'nt^p]  :=  t 
if  lose  then  return(l) 
else  return(s-test&sett) 
end-function 

function  reset 

lose  :=  1 

other  :=  myxurrent[q\ 
reset  all  of  rS[-v>ther) 
current -rather 
lose  ~  0  %  Done  reset,  allow  q  freedom 

end-function 

Figure  4:  Multi-use  binary  test-and-set. 


Theorem  9  For  any  m  >  0,  there  is  a  strongly-wait- 
free  consensus  algorithm  xising  2t7M-1  faulty  read-modify- 
write  registers,  provided  the  total  number  of  memory 
failures  is  at  most  m: 

CO'S8{RMW,m,consensus)  <  2m-f-l. 

Recall  that  a  register  is  k-faulty  if  it  can  change  its 
value  spontaneously,  without  any  processor  writing  into 
it,  at  most  k  times. 

Corollary  10  For  any  1  <k  <m,  there  is  a  strongly- 
wait-free  consensus  algorithm  using  2m-¥\  RMW  regis¬ 
ters  where  at  most  [yj  registers  are  k-faulty: 

CONSfc(RMW,  ,  consensus)  <  2m -fl. 

For  A;  =  1,  the  above  bound  is  tight: 

Theorem  11  There  is  no  n-processor  consensus  algo¬ 
rithm  using  fewer  than  2f  +  l  RMW  registers,  at  most 
f  of  which  may  be  1-faulty,  and  which  survives  \nlT\ 
processor  failures. 

Proof:  Assume  to  the  contrary  that  there  is  a  solution 
using  2/  registers.  Let  the  initial  values  of  the  registers 
be  ui,...,U2/,  respectively. 

The  processor-failure  assumption  requires  that  in  any 
run  in  which  only  half  the  processors,  pi,  ...,Pfn/ai  take 
steps,  they  must  eventually  decide.  Also,  if  their  inputs 
are  identical,  the  validity  condition  requires  that  they 
must  decide  on  that  value,  as  they  don’t  know  whether 
the  other  decision  value  is  an  input  of  some  processor. 


%  Prevent  q  from  winning 
%  where  is  q? 
%  clean  other  copy 
%  Use  clean  copy  in  future 
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Thus,  there  is  a  failure-free,  finite  run  x  where  half  the 
processors  run  alone  with  input  +1  and  decide  on  -(-1. 
Let  the  final  values  of  the  registers  ri,...,r/,  in  x  be 
vi , wj/ ,  respectively. 

Now  consider  a  run  in  which  the  other  processors, 
Prn/2i+i)— .Pn  take  steps,  run  alone  with  input  -1, 
but  find  the  registers  hold  values  ui,  ...,u/,v/4.i,...,V3/. 
This  is  consistent  with  a  run  in  which  Pi,—,Pfn/2) 
have  taken  no  steps  and  /  faults  have  changed  the  val¬ 
ues  of  r/+i,...,r2/  to  Vf+i,...,V2f,  respectively.  Hence, 
P|-„/2i+i,"-,Pn  must  all  decide  -1.  But  this  run  is  also 
consistent  with  a  run  in  which  Pi,...,P(’n/2i  have  al¬ 
ready  decided  -1-1,  and  /  faults  have  changed  the  val¬ 
ues  of  ri,...,r/  back  to  respectively.  Hence, 

Prn/2i+i>  — >Pn  must  all  decide  -Hi,  a  contradiction.  I 

Corollary  12 

•  COtiSi{RMW,f  ,consenaus)=2f +1,  and 

•  COtiS{RMW,m,consensus)  =  2m+l. 

The  proof  of  Theorem  9  is  based  on  the  algorithm 
in  Figure  5,  which  solves  strongly-wait-free  consensus 
using  2m  1  faulty  read-modify-write  registers,  pro¬ 
vided  the  total  number  of  memory  failures  is  at  most 
m.  The  input  to  processor  p  is  initially  assigned  to 
register  inputp,  local  to  p,  and  an  appropriate  output 
values  must  be  assigned  to  register  decidep.  Figure  5 
uses  the  notation  lock(r)  and  unlock  to  mark  the  be¬ 
ginning  and  end  of  atomic,  exclusive  access  to  shared 
read-modify-write  register  r.  That  is,  it  is  assumed  that 
a  processor  can  lock  only  one  register  at  a  time,  that  a 
processor  does  not  fail  between  pairs  of  lock(r)  and  un¬ 
lock  statements,  and  that  any  non-faulty  processor  that 
reaches  a  lock  instruction  eventually  executes  it. 

The  proof  of  correctness  proceeds  as  follows.  Note 
first  (from  lines  5-10  in  the  protocol  body)  that  each 
processor’s  protocol  performs  O(m^)  wait-free  opera¬ 
tions  before  deciding  and  terminating.  Hence,  the  al¬ 
gorithm  is  strongly-wait-free,  and  it  sufilces  to  consider 
only  complete  runs,  those  in  which  every  processor  has 
terminated.  Thus,  the  algorithm  is  correct  if  its  com¬ 
plete  runs  satisfy  the  validity  and  agreement  conditions. 
The  algorithm  is  analyzed  under  a  stronger  fault  model 
which  allows  m  independent  faults  to  occur  to  each  of 
the  vote,  plus  and  minus  fields  of  the  shared  registers, 
up  to  3m  faults  in  all.  These  faults  are  modeled  as 
assignments  to  the  appropriate  register  fields.  The  va¬ 
lidity  condition  is  proven  in  a  straightforward  manner 
(Lemma  5.1).  Next,  we  argue  that  the  algorithm  is  cor¬ 
rect  if  and  only  if  each  of  a  constrained  set  of  execu¬ 
tions  is  correct  (Lemmas  5.2-5.4).  These  executions  are 
shown  to  satisfy  an  invariant  that  implies  the  agreement 
condition  (Lemmas  5.5-5. 7). 

Note  that  nowhere  in  any  processor’s  code  is  a  shared 
register  field  ever  set  to  0. 


Protocol  for  processor  p,  inputp  €  {—1,  -Hi}: 

type  reg  =  record  minus,  plus:  in  {0, 1} 
vote:  in  {—1,0, -Hi} 

shared  r:  array[l..(/-Hl)*]  of  reg,  initi^y  (0,0,0) 
local  decidep:  in  (-1, 0,  -Hi},  initially  0 

1:  for  i  =  1  to  2m-Hl  do 

%  indicate  that  inputp  is  a  valid  input 
2;  if  inputp  =  -Hi  then  JtMW(r[i],plus,  1) 

3:  else  ilil/li'^(r[t],  minus,  1)  od 

4:  d  :=  inputp  ; 

5:  for  t  =  1  to  2m-Hl  do 

%  push  sum  away  from  0 

6:  RMW{r[i],vote,d) 

7:  sum  :=  0 

8;  for  j  =  1  to  t  do 

9:  sum  :=  sum  -H  RMW{r\3\,  vote,  d)  od 

10:  d  :=  valid(sum)  od 

11:  deddep  :=  d  ;  %  make  final  decision 

1:  function  RMW {reg,  field,  t):  integer 

%  atomically  set  reg. field  from  0  to  t. 

2:  lock(rep) 

3:  if  reg.  field  =  0  then  reg.  field  :=  t 

4:  tmp  :=  reg. field 

5:  unlock 
6:  return(tmp) 

7:  end-function 

1:  function  valid{v):  integer 

%  return  sign{v)  if  valid,  inputp  otherwise 
2:  if  u  =  0  or  sign{v)  =  sign{inputp) 

3:  then  return(inputp) 

4:  else  sm  :=  0 

5:  for  j  =  1  to  2m -Hi  do 

6:  if  sign{v)  =  -Hi  then  sm  :=  sm  -H  r[7].plus 

7:  else  sm  :=  sm  -H  r[;J.mmus 

8:  od 

9:  if  sm  <  m 

10:  then  return(inputp)  %  v  is  not  valid 

11:  else 

12:  for  j  =  1  to  2m-Hl  do 

13:  if  aign{v)  =  +1 

14:  then  Jill4W(r[j], plus,  1) 

15:  else /iMlV(r[;),  minus,  1) 

16:  od 

17:  return(sipn(v)) 

18:  end-function 

1:  function  sipn(x):  integer 
2:  if  X  >  0  then  return(-Hl) 
else  if  X  =  0  then  return(O) 
else  if  X  <  0  then  return(— 1) 

3:  end-function 

Figure  5:  Consensus  in  the  presence  of  m  memory 
failures. 
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Lemma  5.1  (Validity) 

1.  No  processor  p  writes  to  a  plus  (respectively, 
minus)  field  unless  either  inputp  =  +1  (respec¬ 
tively,  inputp  =  -1),  or  the  processor  has  previ¬ 
ously  observed  1  as  the  value  in  a  m+1  of  the  plus 
(respectively,  minus)  fields. 

2.  No  processor  writes  to  a  plus  (respectively,  minus) 
field  unless  +1  (respectively,  —1)  is  the  input  of 
some  processor. 

3.  No  processor  decides  +1  (respectively,  —1)  unless 
the  processor  has  previously  observed  1  as  the  value 
in  m+1  of  the  plus  (respectively,  minus)  fields. 

4-  No  processor  writes  +1  (respectively,  —1)  to  a  vote 
field  without  first  assigning  1  to  the  plus  (respec¬ 
tively,  minus)  fields  of  all  2m +1  registers. 

We  call  two  complete  runs  x  and  y  similar  if  each 
processor  has  the  same  input  value  in  x  as  in  y,  and 
decides  on  the  same  value  in  x  as  in  y. 

Next,  note  that  the  read-modify-write  in  line  9  mod¬ 
ifies  r\j].vote  only  if  r\j].vote  is  first  observed  to  be  0. 
Since  the  same  processor  will  have  either  set  or  observed 
r\j\.vote  0  in  an  earlier  scan,  this  observation  of  0 
and  resulting  modification  is  due  to  a  memory  fault  on 
r\j].vote. 

Lemma  5.2  For  any  complete  run  y  there  is  a  similar 
run  X,  such  that  x  contains  no  more  memory  faults  than 
y,  and  such  that  in  x  no  memory  fault  assigns  0  to  any 
vote  field. 

Proof:  Let  be  a  complete  run  of  the  form 
yi',r\j].vote  :=  0;j/2,  where  r\j].vote  :=  0  is  a  memory 
fault.  There  are  several  cases; 

•  No  operation  in  j/2  references  r[j].vote. 

Then  yi;y2  is  a  complete  run  that  is  similar  to  y, 
has  no  more  memory  faults  than  y,  and  has  one 
fewer  (faulty)  assignments  of  0  to  a  vote  field. 

•  The  first  reference  to  r\j].vote  in  j/2  is  a  memory 
fault.  Then  y^  can  be  written  as  yz]r\j].vote  = 
v;j/4,  where  yz  contains  no  reference  to  r\j].vote, 
T\j].vote  =  u  is  a  memory  fault  and  v  G  {—1,0, 1}. 
Then  yi;y3',r\j].vote  :=  v;j/4  is  a  run  of  the  sJgo- 
rithm  that  is  similar  to  y,  contains  no  more  memory 
faults  than  y,  and  has  one  fewer  (faulty)  assign¬ 
ments  of  0  to  a  vote  field. 

•  The  first  reference  to  r\j].vote  in  yz  is  a  read- 
modify-write. 

That  is,  1/2  can  be  written  y3\T\j\.vote  = 
0;r[j].vote  :=  v;i/4,  where  where  T\j\.vote  :=  0  is 
the  memory  fault,  r\j].vote  —  0;  r[i].vote  :=  v  is  the 
read-modify-write  by  some  processor  p,  and  yz  con¬ 
tains  no  explicit  reference  to  r\j].vote.  Note  that 


the  read-modify-write  operations  to  the  vote  fidds 
only  change  the  value  when  it  is  non-zero.  Then 
yi\T\j\-vote  :=  v\yz',r\j\.vate  =  v;  1/4  is  a  run  of 
the  algorithm  that  is  similar  to  y,  contains  no  more 
memory  faults  than  y,  and  has  one  fewer  (faulty) 
assignments  of  0  to  a  vote  field. 

In  each  case  the  number  of  faulty  assignments  of  0  to 
a  vote  field  decreases  by  one.  The  lemma  follows  by 
induction.  I 

Lemma  5.3  Any  complete  run  has  a  similar  run,  with 
no  more  memory  faults,  tn  which  no  memory  fault  oc¬ 
curs  at  r\j].vote  when  the  value  is  0. 

Proof:  Let  p  be  a  complete  run  of  the  form 
yiii‘\j]‘Vote  :=  v\y2,  where  r\j].vate  :=  u  is  a  memory 
fault  and  the  value  of  rlj].vote  after  pi  is  0.  Mor^er, 
let  this  be  the  first  such  memory  fault  in  y.  By  the 
previous  lemma,  it  suffices  to  assume  that  no  memory 
fault  assigns  0  to  any  vote  field  in  y.  Since  no  processor 
ever  writes  0  to  any  vote  field,  it  follows  that  the  value 
of  r\j].vote  is  0  throughout  y\.  There  are  several  case: 

•  Either  v  =  0  or  no  operation  in  p2  references 
r\ji\.vote. 

Then  pi ;  p2  is  a  complete  run  that  is  similar  to  y, 
has  one  fewer  memory  faults  than  y,  and  has  one 
fewer  (faulty)  assignments  to  T\j].vote  when  the 
value  is  0. 

•  V  ^  0  and  the  first  reference  to  r\j].vote  in  yz 
is  a  memory  fault.  Then  p2  can  be  written  as 
yz;r[j].vote  =  n';p4,  where  yz  contains  no  ref¬ 
erence  to  r\j].vote  and  v'  6  {—1,0,1}.  Then 
yiiyziT\j].vote  :=  v';yt  is  a  run  of  the  algorithm 
that  is  similar  to  y,  has  one  fewer  memory  faults 
than  y  and  the  same  number  of  (faulty)  assign¬ 
ments  to  r\j].vote  when  the  value  is  0. 

•  v  ^0  and  the  first  reference  to  r[j].vote  in  p2  is  a 
read-modify-write. 

That  is,  yz  can  be  written  yz;rlj].vote  =  v;y4, 
where  r\j].vote  =  v  is  the  read-modify-write  by 
some  processor  p,  and  yz  contains  no  explicit  ref¬ 
erence  to  r[j].vote.  Since  the  value  of  r[j].vote  is 
0  throughout  pi,  this  read-modify-write  operation 
is  from  line  6  in  the  code,  and  the  value  returned 
is  discarded  by  the  executing  processor.  Hence, 
yi ;  1/3 ;r[?] .vote  =  0;r\j].vote  :=  v' •,r\j].vate  := 
v;p4  is  a  run  of  the  algorithm  that  is  similar  to 
p,  has  the  same  number  of  memory  faults  as  p,  and 
one  fewer  (faulty)  assignments  to  r\j].vote  when  the 
value  is  0. 

In  each  case  the  total  number  of  memory  faults  is  either 
reduced,  or  the  total  number  remains  the  same,  with  one 
fewer  (faulty)  assignments  to  r\j\.vote  when  the  value 
is  0.  The  lemma  follows  by  induction.  I 
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Lemma  5.4  Any  complete  run  has  a  similar  run,  with 
no  more  memory  faults,  in  which  memory  faults  to  a 
vote  field  either  changes  its  value  from  +1  to  —1  or 
vice-versa. 

Proof:  By  the  previous  tw<3  lemmas,  it  suffices  to  con¬ 
sider  complete  runs  in  which  no  memory  fault  assigns  0 
to  a  vote  field,  or  over-writes  a  0  in  a  vote  field.  The  only 
remaining  alternatives  axe  memory  faults  which  write 
+1  or  -1,  but  do  not  change  the  value.  These  faults 
can  be  trivially  deleted,  resulting  in  a  run  satisfying  the 
conditions  of  the  lemma.  I 

Call  runs  satisfying  the  conditions  of  this  lemma  legal 
runs.  Call  read-modify-write  operations  to  vote  fields 
that  actually  change  the  value  successful  read-modify- 
writes.  Call  the  t  -  1  reads  by  p  immediately  preceding 
a  successful  read-modify-write  to  r[i).votc  by  p  in  line 
6,  the  collect  for  that  write. 

Lemma  5.5  In  any  legal  run  x,  there  are  exactly  2m+l 
successful  read-modify-writes,  one  to  each  vote  field. 
Furthermore,  the  collects  for  any  two  such  successful 
read-modify-writes  are  not  concurrent:  if  i  <j,  the  col¬ 
lect  for  the  successful  read-modify-write  to  r[t].vofc  pre¬ 
cedes  the  successful  read-modify-write  to  r[t].vote,  vhich 
in  turn  precedes  the  collect  for  the  successful  write  to 
r[i].vote. 

Proof:  By  definition,  every  legal  run  is  complete  and  the 
only  memory  faults  to  vote  fields  change  the  value  from 
+1  to  -1,  or  -1  to  -1-1.  In  complete  runs,  every  pro¬ 
cessor  executes  a  read-modify-write  on  each  vote  field, 
so  each  is  changed  from  0  to  1  at  least  once.  Once  set, 
being  non-zero  is  stable,  so  each  vote  field  has  exactly 
one  successful  read-modify-write. 

The  condition  on  collects  holds  trivially  if  both  suc¬ 
cessful  read-modify-writes  and  their  collects  are  by  the 
same  processor.  Suppose  the  read-modify-writes  are  by 
different  processor,  p  and  q,  to  r[i].vote  and  r\j].vote, 
respectively.  Note  that  q  does  an  unsuccessful  read- 
modify-write  to  r[i].vote  before  the  collect  for  r\j].v'fte 
begins.  Hence,  the  successful  read-modify-write  to 
r[t].vote  by  p  precedes  this.  The  condition  follows.  I 
Let  Sk  be  a  state  of  the  system  in  a  legal  run  of  k 
atomic  operations,  and  let  ASk  be  the  remaining  unex¬ 
ecuted  faults  to  vote  fields  in  a  run  i.e.,  m  minus  the 
number  of  such  faults  so  far.  Let  CSk  be  the  number 
of  O’s  in  the  registers  in  Sk,  define  Sk  to  be  the  sum  of 
the  vote  fields  in  Sk,  Si^^r[i].vote,  and  finally,  define 
Ak  to  be  |Ejb|  +  CSk  —  2ASk. 

Lemma  5.6  For  any  k  >  0,  Ak  >  0. 

Proof:  We  first  characterize  the  changes  to  these  pa¬ 
rameters  that  can  result  from  any  single  step  of  the  al¬ 
gorithm.  That  is,  let  ir  be  a  step  in  a  legal  run  that 
changes  the  state  from  Sk  to 


1.  IT  is  a  step  of  the  adversary. 

That  is,  IT  is  r[i].vote  :=  v,  where  the  value  of 
r[i].vofc  is  —V  in  Sfc.  Note  that  CSit+i  =  CSk 
and  ASk+i  =  ASk  -  1-  Here  there  are  four  key 
sub-cases. 

(a)  E*  =  0.  Then  |Et+il  =  2,  and  Ak+i  =  A* -1-4. 

(b)  0  <  |E*1  <  |Efc+i|.  Then  |E*+il  =  |E*|  -H  2, 
and  again  A^+i  =  A^  -f  4. 

(c)  0  <  lEfcl  =  lEfc+i|.  Then  lE*+i|  =  |Efc|  =  1, 
and  Ajfc+i  =  Ak  -I-  2. 

(d)  lEk-uil  <  |E*|.  Then  |Efc+i|  =  |Efc|  -  2,  and 
Afc+i  =  A*. 

2.  IT  is  a  successful  read-modify-write. 

That  is,  IT  is  r[t].vote  =  0;r[i].t;otc  :=  v.  Then 
CSk+i  —  CSk  —  1,  ASk+1  —  ASk,  and  E/t+i  = 
Efc  +  V.  There  are  two  subcases: 

(a)  |Ek|  <  lEfc+i|.  Then  |Efc+i|  =  |Ek|  -I- 1,  and 
Afc+i  =  Ak. 

(b)  |Efc+il  <  |Efc|.  Then  |Efc+i|  =  |Efc|  -  1,  and 

Afc+i  =  Afc  -  2. 

3.  is  any  other  atomic  step. 

Then  CSk+i  =  CSk,  ASk+i  =  ASk,  and  lEk+il  = 
|Efc|.  Hence  At  =  Afc+i. 

In  every  (sub)ca8e  but  one,  2b,  the  value  At+i  is 
greater  than  or  equal  to  A*.  The  problematic  case 
is  the  occurance,  then,  of  read-modify-write  operations 
that  decrease  the  value  of  jEj.  Intuitively,  such  opera¬ 
tions  occur  because  faults  have  occurred  so  as  to  cause 
a  processor  to  inadvertently  “move”  the  value  of  |E| 
in  the  wrong  direction.  As  the  inductive  proof  below 
shows,  for  each  such  read-modify-write  operation  that 
decreases  |E|  by  2,  there  must  be  an  earlier  matching 
fault  of  type  la,  lb  or  Ic  that  increases  |E1  by  at  least 
2,  and  the  invariant  follows. 

The  proof  of  the  lemma  proceeds  by  induction  on  the 
prefixes  of  the  run.  Clearly  the  invariant  holds  for  the 
empty  run  (A5o  ==  m;  CSo  =  2m -1-1;  and  |Eo|  =  0). 
Let  an  be  a  prefix  of  the  run,  where  n  is  the  fc  -I-  I’st 
atomic  operation  and  the  invariant  holds  for  every  state 
so,...,Sk.  By  the  analysis  above,  no  atomic  step  of  the 
algorithm  can  falsify  the  invariant  unless  case  2b  applies. 
In  this  case,  x  is  a  successful  read-modify-write  by  some 
processor  p,  r[i].vote  =  0;r[t].votc  :=  v,  and  |Efc+i|  = 
|Ek|  -  1.  Note  that  E*  #  0. 

Since  Sk  ^  0  and  this  is  a  legal  run,  in  which  the 
vote  fields  are  first  written  sequentially  by  read-modify- 
write  operations,  t  >  1.  Moreover,  by  Lemma  5.5 
a  =  ai;  r[i  -  l].vote  =  0;  r[»  -  1]  :=  w';  oair,  where 
r{i  -  l].vot€  =  0;  r[t  -  1]  :=  v'  is  the  successful  read- 
modify-write  to  r[t  —  l].vote  and  there  are  no  success¬ 
ful  writes  in  aa.  Let  Sj  be  the  state  at  the  begin¬ 
ning  of  aa,  just  after  the  successful  read-modify-write  to 
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r[«-l].voto.  By  induction,  Aj  >  0.  Also  by  Lemma  5.5, 
aa  contains  the  t  —  1  reads  in  the  collect  by  p  that  pre¬ 
cedes  V.  In  addition,  since  none  of  the  operations  in  aa 
are  successful  read-modify-writes,  by  the  analysis  above 
A^  <  ...  <  A*. 

Next,  consider  the  sequence  of  values  Ej,...Efc.  We 
examine  cases  depending  on  the  sign  of  the  sum  col¬ 
lected  by  p,  and  show  that  a  fault  described  in  case  lb 
or  Ic  must  occur  in  aa,  implying  A^  >  3,  and  so 

Afc+i  >  1. 

•  The  collect  sums  to  a  valid  value.  There  are  two 
subcases. 

-  Some  E  in  Ej,  ...Tfc  has  the  same  sign  as  the 
sum  collected. 

Since  |E/b+i|  =  |E«,|  - 1,  the  sign  of  the  collect 
and  hence  of  E  is  different  than  the  sign  of 
Ejb.  This  must  be  due  to  a  fault  of  type  lb  or 
Ic,  above.  That  is,  there  exist  Er  and  Er+i 
in  this  sequence  such  that  either  Er  =  0  and 
|Er+i|  =  2,  or  |Er|  =  jEr+il  =  1  and  Er  = 
-Er+i.  Then  either  Ar+i  =  Ar  -b  4  >  5  or 
Ar+i  =  Ar  -I-  2  >  3,  respectively,  so  A*  >  3. 
Hence,  A*+i  >  1. 

-  No  E  in  Ej,  ...Ejb  has  the  same  sign  as  the  sum 
collected. 

Then  a  majority  of  the  *  -  1  registers  each 
have  the  same  sign  as  the  collect  and  as  v  at 
some  point  in  the  interval,  but  not  at  the  end 
of  the  interval.  Then  a  fault  in  the  interval 
must  change  the  value  of  one  of  these,  from 
V  to  -V.  That  is,  there  exist  Er  and  Er+i  in 
this  sequence  such  that  |Er+i|  =  |Er|  -1-2,  and 
Ar+i  =  Ar  4  >  5.  Hence,  Ajt+i  >  3. 

•  The  collect  sums  to  0.  There  are  two  subcases. 

-  Some  E  in  Ej,  ...E*;  has  value  0. 

Since  E^  ^  0,  some  fault  must  move  the  sum 
from  0.  That  is,  there  exist  Er  and  Er+i  in 
this  sequence  such  Er  =  0  and  |Er+i|  =  2. 
Hence,  Ar+i  =  Ar  -I-  4  >  5  and  Afc+i  >  3. 

-  No  E  in  Ej,  ...E*  has  value  0. 

That  is,  half  the  registers  are  read  as  positive, 
and  half  as  negative.  Suppose  first  that  there 
exist  Er  and  Er+i  in  the  sequence  Ej,...Efc 
that  have  different  sign:  that  |Er|  =  l^r+il  = 
1  and  Er  =  -Er+i.  Then  Ar+i  =  Ar  -t-  2  >  3 
and  A*  >  3.  Hence,  A*+i  >  1. 

Suppose  next  that  all  of  Ej,  ...E^  have  the 
same  sign.  Since  |Et+i|  =  |E)b  -t-  v|  <  |Ej^|, 
they  have  different  sign  than  v.  Since  the  col¬ 
lect  read  half  the  registers  with  vote  =  — v 
and  half  with  vote  =  v.  Since  the  E  all  have 
sign  different  than  v,  some  fault  changes  a 


value  from  v  to  — v  in  aa.  That  is,  there 
exist  Er  and  Er+i  in  the  sequence  such  that 
|Er+i|  =  |Er|-b2,  and  Ar+i  =  Ar+4  >  4  >  5. 
Hence,  A^+i  >  3. 

•  The  collect  is  nonzero  and  invalid. 

By  Lemma  5.1,  all  i-1  of  the  earlier  successful  read- 
modify-writes  wrote  the  only  valid  value,  v,  yet  the 
summed  collect  had  opposite  sign.  Hence,  in  a  at 
least  registers  had  faults  changing  the  value 
from  V  to  —V.  Recall  that  ASk+i  is  m  minus  the 
number  of  faults  in  air;  hence,  ASk+i  <  m  -  , 

and  2A5ib+i  <  2m  —  t.  Hence,  we  have 

Afc+i  =  |Eit+i| -I- <75*+!  —  2A5fc+i 

=  |Sfc+i| -f- (2m -1- 1  —  t)  —  2i45fc+i 

>  |Et+i  I  -H  2m  -I- 1  —  t  —  (2m  —  t) 

>  |St+i|-l-l 

>  1  I 

Lemma  5.7  (Agreement)  All  processors  decide  on 
the  same  value. 

Proof:  Consider  Et  in  the  system  state  Sk,  immediately 
after  the  the  last  register,  r[2m-f-l],  has  been  written. 
By  Lemma  5.6,  2ASk  <  lEjbj.  Henceforth,  there  are 
insufficient  faults  remaining  to  change  the  sign  of  Ej^, 
or  reduce  it  to  0.  Since  all  the  reads  in  any  final  collect 
(upon  which  any  decision  is  based)  are  made  after  Sk, 
all  processors  decide  on  the  same  value.  I 

5.2  Unbounded  failures  per  register 

The  protocol  of  Figure  5  does  not  work  when  the  num¬ 
ber  of  faults  per  register  is  unbounded.  For  the  case  of 
oo-faulty  registers  we  use  a  slightly  different  technique. 

Consider  an  array  of  2*  -I- 1  read-modify-write  regis¬ 
ters  over  the  range  {-1,0, -fl},  initialized  to  0.  Each 
processor  scans  the  registers,  one  at  a  time,  writing  an 
input  value  d  6  {— to  each  register  whose  value 
is  zero  and  returning  the  resulting  non-zero  value  of  the 
register.  After  scanning  all  2i  -I-  1  registers,  the  pro¬ 
cessor  outputs  the  majority  of  the  values  returned  from 
the  registers.  Call  this  construct  filteiii).  Three  obser¬ 
vations  are  important: 

1.  If  at  most  t  registers  are  faulty  then  each  processor 
sees  a  majority,  which  is  the  input  value  for  some 
processor. 

2.  If  at  most  t  registers  are  faulty  and  all  processors 
have  the  same  input  value  then  all  processors  see 
the  same  value  in  majority. 

3.  If  there  are  no  memory  faults  then  all  processors 
compute  the  same  majority  value,  which  is  the  in¬ 
put  of  some  processor. 
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Protocol  for  processor  p,  Vi  €  {— 1,+1}: 

type  rtg  =  record  minus,  plus:  in  {0, 1} 
val:  in  {-1,0,  +1} 

shared  r;  array[l. .(/+!)*)  of  reg,  initially  (0,0,0) 
local  Vo:  in  {— 1,0,+1},  initially  0 

%  indicate  that  v.  is  a  valid  input 
for  j  =  1  to  2/+1  do 
if  Vi  =  +1  then  RMW(j,plus,  1) 
else  RMW{j,  minus,  1) 

Vo  :=  decide(/,  V{)  %  make  final  decision 

function  decide(i,  input) 
if  t  =  0  then  return(AAfV^(l,  val,  input)) 
else  input  :=  valid{dedde(i  —  1,  input)) 
sum  :=  0 

for  j  =  1  to  2i  + 1  do 
sum  :=  sum  +  RMW(i^  +  j,  val,  input) 
return(sipn(sum)) 
end-function 

functions  valid  and  RMW  are  as  in  Figure  5 

Figure  6:  Consensus  in  the  presence  of  /,  oo-faulty 
RMW  registers:  (/  +  1)^  constant-size  registers. 


5.2.1  RMW  registers  of  constant  size 

Simply  having  each  processor  use  f+1  copies  of  filter{f), 
(or  (/-H)(2/-t-l)  3-v^ued  RMW  registers)  with  Vi  as  the 
input  to  the  first  copy  and  the  output  of  the  previous  fit¬ 
ter  as  the  input  to  the  next  results  in  a  consensus  on  the 
final  output.  However,  the  protocol  in  Figure  6  employs 
a  recursive  construction  and  the  validation  technique 
from  the  algorithm  in  Figure  5  to  achieve  consensus, 
while  using  about  half  as  many  registers  (each  register 
is  slightly  larger,  requiring  two  boolean  fields  and  one 
three- valued  field). 

Theorem  13  For  any  f  >0,  there  is  a  strongly-wait- 
free  consensw  algorithm  using  (/-l-l)^  12-valued  read- 
modify-write  registers,  at  most  f  of  which  are  oo-faulty: 

CONS oo{constant.rmw,f,  consensus)  <  {f+1)^. 

Sketch  of  proof:  The  algorithm  is  recursive;  assume  a 
strongly- wait-free  consensus  algorithm  for  i  - 1  oo-faulty 
registers,  called  decide(i  -  1,  input).  With  filtetfi), 
strongly-wait-free  consensus  with  t  oo-faulty  registers 
can  be  solved  recursively  as  follows:  first,  as  in  the  al¬ 
gorithm  of  Figure  5,  each  processor  p  validates  its  input 
value  by  setting  bits  in  2/-»-l  registers.  Then  it  runs 
decide{i  -  l,inputp),  and  obtains  a  decision  value,  d.  If 
p  finds  that  d  is  valid  (  bits  in  at  least  f+1  correspond¬ 
ing  validation  fields  are  set),  p  enters  the  filter  with  d 
as  input,  otherwise  p  enters  the  filter  with  inputp.  The 
output  of  the  filter  is  then  the  final  decision  value. 


1.  If  fewer  than  i  faulty  registers  occur  in  decide{i  — 
1),  then  all  processors  leave  decide{i  -  1, inputp) 
with  the  same  valid  value,  d,  and  by  the  second 
observation,  all  leave  filter{i)  with  the  value  d. 

2.  If  however,  there  are  t  faulty  ref^ters  in  decide{i  — 
1,  inputp),  then  processors  leave  decide(i  — 
1,  inputp)  with  arbitrary  values,  but  the  validity 
check  ensures  they  enter  filterii)  with  a  value  which 
is  the  input  of  some  processor.  Since  in  this  case 
there  aure  no  faults  in  fdterii),  by  the  third  obser¬ 
vation,  all  processors  leave  filter{i)  with  the  same 
value  d,  and  d  is  the  input  of  some  processor. 

In  either  case,  the  validity  and  agreement  condi¬ 
tions  are  satisfied  by  decide{i,  inputp).  Since  both 
decide{i  —  1,  input)  and  filterii)  are  strongly-wait-free, 
so  is  decide{i,  inputp). 

Figure  6  presents  a  strongly-wait-free  consensus  algo¬ 
rithm  implementing  this  recursion.  As  in  the  algorithm 
of  Figure  5,  only  one  set  of  validity  bits  is  needed,  as  any 
processor  that  reads  /  -I- 1  bits  for  value  d,  in  any  level 
of  the  recursion,  sets  all  2/  -f  1  bits  before  proceeding. 
This  has  the  consequence  that  no  value  is  written  by  a 
processor  unless  it  has  first  set  all  2/-f-l  validity  bits 
for  that  value.  Hence,  at  any  level  t  of  the  recursion,  a 
processor  can  compute  an  invalid  majority  only  if  i  -t- 1 
faults  have  occurred.  The  full  proof  of  this  algorithm 
is  left  to  the  full  paper.  The  total  number  of  registers 
used  is  53i=o  2t  +  1  =  (/+!)*•  I 

5.2.2  RMW  registers  of  exponential  size 

Another  simple  consensus  algorithm  can  be  designed  us¬ 
ing  filterif)  and  3/-t-l  registers,  each  of  exponential  size. 

Theorem  14  For  any  /  >  0,  there  is  o  strongly-wait- 
free  consensus  algorithm  using  3/ -1-1  read-modify-write 
registers  of  exponential  size,  at  most  f  of  which  are  oo- 
faulty: 

CONSoo{exponential-rmw,f,consensus)  <  3/  1. 

Proof:  The  algorithm  iteratively  runs  filterij)  over  (a 
specific  enumeration)  of  every  subset  of  the  registers 
of  size  2/-f-l,  using  a  different  bit-field  in  each  register 
in  each  run  of  filterif),  and  using  the  output  from  the 
previous  filter  as  input  to  the  next.  Figure  7  presents 
the  details  of  the  algorithm. 

The  algorithm  requires  exponential  time  for  each  pro¬ 
cessor,  but  is  still  strongly-wait-free.  Since  /  bounds  the 
total  number  of  register  faults,  the  majority  computed 
by  each  processor  is  always  the  input  of  some  processor, 
and  the  validity  condition  is  satisfied.  Moreover,  since 
at  least  one  instance  of  filter  uses  no  faulty  registers, 
the  third  observation  above  implies  that  at  some  point 
all  processors  exit  filter  with  the  same  value.  By  the 
second  observation,  all  processors  exit  each  later  filter 
with  that  value.  The  agreement  condition  follows.  I 
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6  Discussion 


Protocol  for  processor  p,  Vj  €  {—1,  +1}: 

type  S  =  subsets  of  {1,  ...,3/+l} 
of  size  2/+1 
val  =  {-1,0, +1} 

register  =  rep[S]:  array  S  €  5  of  val 
%  Each  teg.  is  an  array  of  val  fields, 
shared  r[j]:  array  1  <j  <  3/+1  of  register 

%  initially  all  val  fields  0 
local  V/:  ranges  over  {—1,0,+!},  initially  0 

input  :=  Vi 

for  each  subset  5  €  5,  in  enumeration  order  do 
sum  ;=  0 
for  each  j  €  5  do 

sum  :=  sum  +  RMW{r\j],  [5],  input) 
input  :=  sign{sum) 

Vf  :=  input 

Figure  7:  Consensus  in  the  presence  of/,  oo-faulty 
RMW  registers;  3/+1  exponential-sized  registers. 


5.3  Faulty  RMW  registers  are  universal 

Herlihy  has  defined  the  notion  of  a  universal  object, 
as  an  object  that  can  be  used  to  construct  a  wait-free 
implementation  of  any  other  object  [Her91].  He  then 
showed  that  consensus  for  n  processes  is  universal  for 
systems  with  at  most  n  processes. 

Theorem  15  Read-modify-write  registers  are  univer¬ 
sal  objects  for  systems  with  at  most  n  processes,  if  a 
bounded  number  of  them  are  oo-faulty:  For  all  objects 
X,  COTiSoo{nnbounded.size.rmw,f,X)  <  oo. 

Proof:  The  proof  relies  on  Herlihy’s  construction  in  his 
proof  of  the  universality  of  consensus,  which  uses  0(n®) 
reliable  atomic  read/ write  registers  of  unbounded  size 
and  0(ti®)  reliable  consensus  objects  over  a  bounded 
domain.  The  data-oblivious  constructions  of  Theorem  5 
allow  us  to  construct  the  reliable  unbounded  atomic 
read/write  registers  from  oo-faulty  read-modify-write 
registers. 

Next,  by  Theorem  13,  we  know  that  suflSciently  many 
read-modify-write  registers  can  be  used  to  implement 
n-processor  binary  consensus,  when  any  fixed  number 
of  them  may  be  oo-faulty.  Known  constructions  can 
be  used  to  implement  multi-valued  consensus  from  bi¬ 
nary  [TC84].  Moreover,  the  read-modify-write  objects 
used  in  the  binary  construction  and  so  in  the  multi¬ 
valued  constructions  may  be  easily  reset  as  Herlihy’s 
universal  construction  requires.  (Specifically,  the  reset 
operations  are  not  concurrent  with  other  consensus  op¬ 
erations.)  I 


We  have  studied  memory  failures  that  are  restricted  in 
total  number,  or  in  the  number  of  data  objects  that  may 
be  affected.  Memory  failures  may  be  restricted  in  time, 
as  well.  For  example,  such  constrained  memory  faults 
are  studied  in  work  on  self-stabilizing  systems  defined  by 
Dijkstra  pij74].  Self-stabilizing  systems  are  required  to 
recover  once  the  final  memory  fault  occurs,  and  the  sys¬ 
tem  is  in  an  arbitrary  state.  Hence,  such  failures  may 
affect  local  memories  of  processors,  as  well  as  the  shared 
memory.  However,  work  in  self-stabilization  (neces¬ 
sarily)  considers  only  non-terminating  control  problems 
such  as  the  mutual  exclusion  problem,  whereas  we  also 
study  short-lived  objects  such  as  the  consensus  problem, 
in  which  a  processor  makes  a  single,  irrevocable  decision 
after  a  finite  number  of  steps. 

Three  previous  papers  investigated  initialization  fail¬ 
ures  that  are  restricted  to  the  shared  memory,  and  in¬ 
spired  our  work  [FMRT90,  FMT91,  MTY92].  A  shared 
register  is  subject  to  initialization  failure  if  the  shared 
register  contains  an  arbitrary  unknown  value  when  the 
algorithm  begins.  These  three  papers  assume  that  all 
the  shared  registers  are  subject  to  initialization  failures, 
and  study  both  control  and  decision  problems. 

In  this  paper  we  used  the  consensus  problem  to  ex¬ 
plore  properties  of  faulty  shared  memory.  Much  is 
known  about  the  consensus  problem  in  other  mod¬ 
els.  See,  e.g.  [Abr88,  Fis83,  FLM86,  FLP85,  LA87]. 
We  also  investigated  the  question  of  constructing  re¬ 
liable  registers  in  an  unreliable  environment.  This 
relates  to  the  problem  of  implementing  one  type  of 
shared  objects  from  another.  Such  work  includes: 
[Lam86,  VA86,  Blo87,  BP87,  SAG87,  LTV89,  Her91]. 

6.1  Open  problems 

There  remain  many  unresolved  issues  related  to  shared 
memory  failures  in  distributed  systems.  Faulty  versions 
of  other  shared  data  objects,  such  as  multi-valued  test- 
and-set  registers,  m-registers,  or  compare-and-swap,  are 
of  interest.  We  have  tight  bounds  on  only  a  few  prob¬ 
lems;  more  efficient  constructions  and  corresponding 
lower  bounds  would  also  be  interesting.  For  example, 
our  implementations  of  consensus  from  read-modif}’- 
write  objects  suggest  a  trade-off  between  the  number 
of  read-modify-write  objects  and  their  size-we  conjec¬ 
ture  that  2/4-1  registers  are  sufficient,  but  whether  they 
must  be  large  is  not  clear. 

It  would  be  particularly  interesting  to  implement 
memory-fault  tolerant  data  objects  directly  from  sim¬ 
ilar,  faulty  objects,  such  as  test-and-set  from  test-and- 
set,  without  using  atomic  registers,  or  read-modify- 
write  from  read-modify-write,  without  using  an  un¬ 
bounded  universal  construction. 

Theorem  3  describes  a  composition  in  which  faulty 
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objects  axe  first  used  to  construct  fault-free  objects, 
which  can  then  be  used  to  construct  other  fault-free 
objects.  Our  fault  definition  is  not  general  enough  to 
support  the  dual  result:  using  fault-intolerant  construc¬ 
tions  from  faulty  objects,  then  using  the  resulting  faulty 
objects  in  a  fault-tolerant  construction.  The  problem  is 
to  characterize  the  result  of  applying  a  fault-intolerant 
construction  of  a  type  V  object  from  faulty  objects  of 
type  X.  We  are  currently  considering  general  defini¬ 
tions  that  would  support  such  a  result,  and  more  com¬ 
plex  compositions  of  two  or  more  fault-tolerant  compo¬ 
sitions. 

All  our  solutions  are  deterministic.  It  would  be  inter¬ 
esting  to  explore  the  use  of  randomization  to  tolerate 
memory  failures.  Also,  there  is  much  work  to  be  done 
in  exploring  the  effect  of  memory  failures  in  other  mod¬ 
els,  such  as  synchronous  or  semi-synchronous  models. 
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Abstract 

Our  purpose  is  to  implement  clocks  and,  in  gen¬ 
eral,  counters  in  a  shared  memory  environment.  A 
concurrent  counter  is  a  counter  that  can  be  incre¬ 
mented  and  read,  possibly  at  the  same  time  by 
many  process^.  We  study  counters  that  achieve 
high  level  of  concurrency  and  thus  are  likely  to  re¬ 
duce  memory  contention;  require  only  weak  atom¬ 
icity  and  thus  are  easy  to  implement;  do  not  de¬ 
pend  on  the  initial  state  of  the  memory  and  hence 
are  more  robust  to  memory  changes;  and  are  wait- 
free  -  one  process  cannot  prevent  another  process 
from  finishing  its  increment  or  read  operations  - 
and  thus  can  tolerate  any  number  of  process  fail¬ 
ures.  We  concentrate  on  providing  upper  and  lower 
bounds  on  the  space  complexity  of  the  counters 
studied. 

1  Introduction 

1.1  The  Concurrent  Counter  Problem 

Counters  are  basic  objects  which  are  used  in  var¬ 
ious  computer  applications.  A  counter  (mod  m) 
holds  an  integer  from  0,  ..,m  —  1,  and  enables  two 
basic  operations:  increment  -  which  increments 
the  value  by  one  (mod  m),  and  look  -  which 
gets  the  current  value.  A  concurrent  counter  is  a 
counter  in  a  shared  memory  environment,  which 
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can  be  incremented  and  looked,  possibly  at  the 
same  time,  by  m2my  precedes.  Throughout  the 
paper  we  assume  that  a  counter  is  always  incre¬ 
mented  modulo  m  for  some  fixed  m. 

Since  counters,  or  objects  which  include  coun¬ 
ters  as  a  special  case,  are  involved  in  many  pro¬ 
tocols,  results  concerning  concurrent  counters  are 
likely  to  have  implications  on  other  problems  in 
asynchronous  computations.  The  implementation 
of  concurrent  counters  raises  many  basic  problems 
concerning  the  possibility  and  the  cost  of  multi¬ 
process  coordination  in  an  asynchronous  shared 
memory  systems.  In  this  paper  we  study  two  types 
of  concurrent  counters. 

A  static  counter  guarantees  that  the  counter  will 
hold  the  correct  value  even  if  it  is  incremented  and 
read  concurrently  by  several  processes.  However,  it 
only  guarantees  that  a  look  operation  returns  the 
correct  value  when  it  is  not  concurrent  with  any 
increment  operation.  In  the  case  that  a  look  op¬ 
eration  overlaps  an  increment  operation,  a  static 
counter  may  return  an  arbitrary  value. 

A  dynamic  counter  guarantees  that  the  counter 
will  hold  the  correct  value  even  if  it  is  incremented 
concurrently  by  several  processes  and  that  pro¬ 
cesses  can  read  a  correct  value  of  the  counter  even 
if  the  read  is  concurrent  with  other  increments  or 
reads.  That  is,  for  a  given  look  operation,  let  ci 
be  the  initial  value  of  the  counter  plus  the  number 
of  increment  operations  that  were  completed  be¬ 
fore  the  look  operation  started,  and  let  C2  be  the 
initiaJ  value  of  the  counter  plus  the  total  number 
of  increment  operations  that  were  initiated  before 
the  look  operation  was  completed.  Then  the  look 
operation  should  return  some  value  between  ci  and 

Cg. 

The  most  common  example  of  a  concurrent 
counter  is  probably  a  global  clock,  which  can  be 
incremented  by  one  process,  but  arbitrarily  many 
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processes  may  gets  its  value  [Lam90].  In  imple¬ 
menting  a  global  clock  one  would  not  like  to  set 
a  bound  on  the  number  of  processes  that  may  see 
the  time  in  this  clock,  though  in  general  only  one 
process  is  allowed  to  change  its  value. 

Another  example  of  a  possible  use  of  a  concur¬ 
rent  counter  is  in  protocols  for  the  wakeup  prob¬ 
lem  [FMRT90].  In  some  of  the  protocols  for  this 
problem  every  process  starts  its  participation  in 
the  protocol  by  incrementing  a  counter.  The  coun¬ 
ters  used  in  the  wakeup  protocols  diflFer  from  the 
one  used  for  global  clocks  in  two  important  fea¬ 
tures:  first,  it  can  be  incremented  by  many  dis¬ 
tinct  processes;  second,  in  each  run  of  the  protocol, 
the  counter  of  the  wakeup  protocols  can  be  incre¬ 
mented  at  most  some  bounded  and  known  number 
of  times.  One  more  example  is  in  fault-tolerant 
solutions  for  the  consensus  or  leader  election  prob¬ 
lems  [Fis83,  KMZ84,  Pet82].  In  such  solutions  it 
is  sometimes  required  to  count  the  number  of  pro¬ 
cesses  that  already  voted.  For  example,  in  [FMT91] 
a  consensus  is  reach  after  some  process  observes 
that  a  counter  is  incremented  by  more  than  half  of 
the  processes.^ 

It  is  easy  to  implement  a  concurrent  counter 
(mod  m)  using  one  register  of  size  log  m  bits,  whose 
values  are  {0,  ...,m  -  1}.  To  increment  or  read 
the  counter  a  process  first  locks  the  access  to  the 
counter,  modifies  or  gets  its  value,  and  then  unlocks 
the  2iccess  to  the  counter.  This  approach  ensures 
the  correctness  of  the  implementation  but  causes 
bottlenecks  when  there  is  high  memory  contention 
since  all  increment  and  look  operations  must  be 
serialized.  Moreover,  it  is  necessary  to  implement 
a  lock  operation  on  a  large  -  log  m  bits  -  register, 
and  in  the  presence  of  failures  access  to  the  counter 
may  get  blocked. 

Our  goal  is  to  design  solutions  to  the  concurrent 
counter  problem  that  achieve  a  high  level  of  concur¬ 
rency,  and  thus  are  also  likely  too  reduce  memory 
contention;  require  only  weak  atomicity  and  thus 
are  easy  to  implement;  do  not  depend  on  the  initial 
state  of  the  memory  and  hence  are  more  robust  to 
memory  changes;  and  are  wait-free  -  one  process 
cannot  prevent  another  process  from  finishing  its 
increment  or  look  operations  -  and  thus  can  tol- 

‘The  model  in  [FMT91,  FMRT90]  assumes  only  one 
shared  register.  Some  of  the  solutions  can  use  our  imple¬ 
mentation  of  dynamic  counters. 


erate  any  number  of  process  failures.  Finally,  and 
most  important,  we  want  our  solutions  to  use  as 
little  shared  space  as  possible. 

1.2  Computational  Model 

The  model  consists  of  a  fully  asynchronous  collec¬ 
tion  of  identical  anonymous  deterministic  processes 
that  communicate  via  bounded  size  shared  regis¬ 
ters  which  are  initially  in  an  arbitrary  unknown 
state.  The  registers  that  are  used  are  binary  regis¬ 
ters  unless  otherwise  indicated.  Also,  unless  other¬ 
wise  stated,  it  is  assumed  that  access  to  a  shared 
register  is  via  atomic  “read-modify-write”  instruc¬ 
tions,  which,  in  a  single  indivisible  step,  reads  the 
value  of  the  register  and  then  writes  a  new  value 
that  can  depend  on  the  value  just  read. 

We  note  that  implementing  a  read-modify-write 
on  a  single  bit  can  be  done  in  one  time  unit  by  a 
simple,  one-input  one-output,  logical  gate,  and  is 
in  general  much  easier  than  on  a  register  of  larger 
size. 

In  some  cases  we  develop  protocols  under  the 
assumption  that  only  one  process  can  increment 
the  counter,  and  in  such  cases  we  assume  only 
read/write  atomicity.  That  is,  in  one  atomic  in¬ 
struction  a  process  may  either  read  from  or  write 
to  a  register.  When  two  or  more  identical  pro¬ 
cesses  may  increment  the  counter,  the  read-modify- 
write  atomicity  cannot  be  replaced  by  the  weaker 
read/ write  atomicity. 

We  assume  that  any  number  of  processes  can  fail. 
The  only  kind  of  failures  we  consider  are  crash  fail¬ 
ures,  in  which  a  process  may  become  faulty  at  any 
time  during  its  execution,  and  when  it  fails,  it  sim¬ 
ply  stops  participating  in  the  protocol. 

Assuming  an  arbitrary  unknown  initial  state  re¬ 
lates  to  the  notion  of  self-stabilizing  systems  de¬ 
fined  by  Dijkstra  [Dij74].  However,  Dijkstra  con¬ 
siders  only  non-terminating  control  problems  such 
as  the  mutual  exclusion  problem,  whereas  our  im¬ 
plementations  of  counters  can  also  be  used  to  solve 
decision  problems  such  as  the  wakeup  and  consen¬ 
sus  problems,  in  which  a  process  makes  an  irrevo¬ 
cable  decision  after  a  finite  number  of  steps. 

It  seems  that  this  model  more  accurately  reflects 
reality,  where  in  many  cases  aJl  processes  are  pro¬ 
grammed  alike,  there  is  no  global  synchronization, 
and  it  is  not  possible  to  simultaneously  reset  all 
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parts  of  the  system  to  a  known  initial  state.  Our 
model  is  similar  to  the  shared  memory  model  stud¬ 
ied  in  [FMKTQO],  except  that  in  [FMRT90],  it  is 
a.ssumed  that  there  is  a  single  finite  sized  shared 
register  and  that  access  to  the  shared  register  is 
via  atomic  read-modify-write  instruction. 

1.3  Summary  of  Results 

We  present  lower  and  upper  bounds  on  the  num¬ 
ber  of  shared  bits  required  to  implement  static 
and  dynamic  counters  as  a  function  of  the  num¬ 
ber  of  processes  that  are  allowed  to  increment  the 
counter.  In  some  cases  we  present  implementations 
that  are  very  efficient  in  both  space  and  time,  while 
in  other  cases  we  show  that  any  implementation 
must  be  very  inefficient.  All  the  results  are  stated 
under  the  assumption  that  the  basic  atomic  op¬ 
erations  are  performed  on  single  binary  registers 
(bits).  These  results  can  be  improved  (in  terms  of 
the  space  used)  when  larger  registers  are  available 
(for  details  see  later).  Since  the  counter  is  a  collec¬ 
tion  of  bits  which  can  be  tested  separately,  we  will 
assume  in  our  upper  bounds  that  m  is  a  power  of  2. 
Finally,  the  notion  of  a  static  (dynamic)  n-counter 
protocol  means  that  no  more  than  n  processes  are 
allowed  to  increment  the  counter.  We  may  let  n  be 
infinity  (oo)  which  means  that  there  is  no  restric¬ 
tion  on  the  number  of  processes  that  can  increment 
the  counter.  We  will  also  use  the  notion  of  an  i- 
bounded  counter  protocol  which  is  a  protocol  where 
a  total  of  at  most  I  increments  are  performed  in  any 
of  its  runs.  We  give  a  brief  overview  of  our  results 
below. 

Optimal  Static  oo-Counters;  We  present  a 
static  oo-counter  protocol  that  uses  only  log  m  bits. 
That  is,  in  the  protocol  there  is  no  bound  on  the 
number  of  processes  that  are  allowed  to  increment 
the  counter  and  it  matches  the  trivial  logm  bits 
lower  bound. 

Optimal  Dynamic  1-Counters:  In  the  case 
when  only  one  process  is  allowed  to  increment  the 
counter,  we  are  able  to  construct  a  protocol  that 
uses  only  logm  bits,  and  hence  matches  the  triv¬ 
ial  lower  bound.  Thus,  our  protocol  gives  an  op¬ 
timal  solution  to  Lamport’s  global  clock  problem. 
In  designing  the  protocol  we  use,  in  a  new  way, 
refiected  binary  Gray  code.  Consequently,  incre¬ 
menting  the  counter  requires  a  single  write  op¬ 


eration,  and  hence,  unlike  other  implementations 
studied  in  the  literature,  it  consists  of  a  single 
atomic  step  in  any  model  that  supports  read/ write 
atomicity  on  single  bits. 

Dynamic  Counters:  We  present  an  m-bounded 
dynamic  oo-counter  protocol  which  uses  exactly  m 
bits.  Then  we  use  it  to  construct  an  (unbounded) 
dynamic  n-counter  which  uses  a(log  m  -I- 1)  shared 
registers,  where  a  is  the  smallest  power  of  2  that  is 
not  smaller  than  n. 

Lower  bound:  Let  *:  =  min(^2M^ 

We  prove  that  any  £-bounded  dynamic  n-counter 
protocol  must  use  at  least  k  registers.  This  result 
holds  even  when  the  processes  have  unique  iden¬ 
tifiers  and  there  is  only  one  possible  initial  state. 
Furthermore,  by  making  various  restrictions  on  the 
way  processes  may  increment  the  counter  we  are 
able  to  tighten  these  bounds. 

1.4  Related  Work 

In  [Lam90],  Lamport  developed  algorithms  to  im¬ 
plement  both  monotonic  and  cyclic  multiple-word 
clocks  that  are  updated  by  one  process  and  read 
by  one  or  more  processes.  Lamport’s  cyclic  clock 
problem  is  a  special  case  of  the  concurrent  (dy¬ 
namic)  counter  problem  where  there  is  only  one 
process  that  can  increment  the  counter  (i.e.,  p  =  1). 
His  solution  uses  2  log  m  4-  2  registers,  and  relies  on 
the  assumption  of  a  known  initial  value. 

Aspnes,  Herlihy  and  Shavit  [AHS91]  introduced 
a  fundamental  new  class  of  networks,  called  count¬ 
ing  networks.  They  used  counting  networks  to 
construct  various  objects  such  as  a  shared  counter 
which  is  an  object  that  can  issue  the  numbers 
0  to  m  —  1  in  response  to  m  requests  by  pro¬ 
cesses.  Counting  networks  can  be  viewed  as  ob¬ 
jects  which  support  one  atomic  operation,  which 
consists  of  both  increment  and  look  .  It  seems 
that  counting  networks  cannot  support  a  look  op¬ 
eration  (without  incrementing  the  counter).  The 
two  constructions  of  counting  networks  in  [AHS91] 
require  0(m  log^  m)  binary  registers.  Our  problem 
seems  to  be  related  to  but  different  from  counting 
networks:  (1)  As  mentioned  above,  it  is  not  clear 
whether  counting  networks  can  support  a  look  op¬ 
eration  without  incrementing  the  counter,  while  we 
implement  a  look  operation  in  our  solutions.  (2) 
All  implementations  of  counting  networks  rely  on 
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the  assumption  of  a  single  initial  state,  that  is, 
all  shared  registers  require  initialization,  while  we 
require  that  a  solution  to  the  concurrent  counter 
problem  will  work  for  any  possible  initial  state. 

In  [FMRT90],  Fischer,  Moran,  Rudich  and 
Taubenfeld  investigated  a  deceptively  simple  prob¬ 
lem  called  the  wakeup  problem.  The  goal  is  to 
design  a  protocol  for  n  asynchronous  identical  pro¬ 
cesses  in  a  shared  memory  environment  such  that 
at  least  one  process  eventually  learns  that  at  least 
r  processes  have  waked  up  and  begun  participat¬ 
ing  in  the  protocol  [FMRT90].  All  the  solutions 
for  that  problem,  except  one,  use  some  implemen¬ 
tation  of  a  counter  in  order  to  count  the  wakeup 
processes.  There  is  one  solution  however,  the  see¬ 
saw  protocol,  where  the  counting  is  done  somehow 
in  the  local  memory  of  the  processes  and  it  requires 
only  two  bits  of  shared  memory  which  are  used  for 
communication.  The  see-saw  protocol  cannot  tol¬ 
erate  even  one  faulty  process,  and  it  seems  that  the 
approach  taken  in  designing  it  cannot  be  axiopted 
to  solve  the  concurrent  counter  problem. 

As  mentioned  earlier,  when  two  or  more  identi¬ 
cal  processes  may  increment  the  counter,  the  read- 
modify- write  atomicity,  assumed  in  this  paper,  can¬ 
not  be  replaced  by  the  weaker  read/write  atomic¬ 
ity.  However,  it  is  shown  in  [AH90,  AH90,  AG91, 
Lam77],  and  follows  from  the  algorithm  in  Subsec¬ 
tion  4.1,  that  it  is  possible  to  implement  a  counter 
using  only  read/ write  atomicity,  when  the  pro¬ 
cesses  have  unique  identifiers.  In  these  implemen¬ 
tations  the  basic  correctness  condition  is  lineariz- 
ability.  That  is,  although  operations  of  concurrent 
processes  may  overlap,  it  should  provide  the  illu¬ 
sion  that  each  counter  operation  is  atomic,  while 
preserving  the  order  in  which  operations  that  do 
not  overlap  happen. 

In  the  full  paper  we  also  cover  the  notion  of  a  se¬ 
rial  counter,  which  intuitively  is  a  dynamic  counter 
in  which  the  executions  of  the  increment  and  look 
operations  are  linearizable.  The  static,  dynamic, 
and  serial  counters  bear  some  similarity  to  the  no¬ 
tions  of  safe,  regular  and  atomic  registers  defined 
by  Lamport  in  [Lam86].  In  a  safe  register,  it  is 
assumed  only  that  a  read  not  concurrent  with  any 
writes  obtains  the  correct  value.  A  regular  regis¬ 
ter  is  a  safe  register  in  which  a  read  that  overlaps 
a  write  obtains  either  the  old  or  new  value.  An 
atomic  register,  is  a  safe  register  in  which  the  reads 


and  writes  behave  as  if  they  occur  in  some  linear 
order. 

2  An  Optimal  Static  oo- Counter 

In  this  section  we  present  a  static  oo-counter  pro¬ 
tocol  that  uses  only  logm  bits.  That  is,  in  the 
protocol  there  is  no  bound  on  the  number  of  pro¬ 
cesses  that  are  allowed  to  increment  the  counter 
and  it  matches  the  trivial  log  m  bits  lower  bound. 

Theorem  2.1  There  is  a  static  oo-counter  proto¬ 
col  that  uses  log  m  registers. 

In  order  to  prove  the  theorem  we  describe  the  Po¬ 
sitional  Protocol.  In  this  protocol  a  process  may 
change  the  value  of  several  registers  during  a  single 
increment  operation.  In  section  5,  we  show  that 
any  optimal  static  oo-counter  must  allow  processes 
to  change  the  value  of  more  than  one  register  dur¬ 
ing  a  single  increment  operation,  and  that  there  is 
no  dynamic  oo-counter  protocol  that  achieves  the 
same  logm  space  complexity.  We  point  out  that 
the  protocol  is  not  even  dynamic  1-counter  proto¬ 
col. 

In  the  Positional  Protocol  the  content  of  the 
k  =  logm  shared  registers  r^-i, •  •  •  ,ro  are  viewed 
as  a  binary  representation  of  the  value  of  the 
counter.  The  increment  operation  is  performed 
by  the  straightforward  (sequential)  algorithm  for 
incrementing  a  binary  number.  That  is:  scan  the 
the  registers  from  right  to  left  (starting  with  tq); 
when  scanning  register  r*,  do  the  following:  (1) 
flip  r„  and  (2)  if  the  value  of  r,-  was  1  before  it  was 
flipped  and  i  <  k  —  1,  then  repeat  this  operation  on 
register  r^+i,  else  terminate  the  increment  opera¬ 
tion.  The  look  operation  is  performed  by  simply 
Trading  the  content  of  the  registers. 

The  correctness  proof  of  this  simple  implemen¬ 
tation  is  somewhat  complicated  by  the  fact  that 
several  increment  operations  may  take  place  si¬ 
multaneously.  The  proof  is  based  on  showing  that 
in  any  execution  in  which  k  complete  increment 
operation  are  performed  (for  arbitrary  k)  and  no 
other  increment  is  initiated,  the  number  of  times 
each  register  is  changed  depends  only  on  the  initial 
content  of  the  shared  registers  and  on  k,  regardless 
the  order  by  which  the  registers  were  accessed  by 
the  various  processes. 
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The  ideas  can  be  generalized  to  work  for  an  ar¬ 
bitrary  m.  Let  us  write  m’s  prime  factorization 
as  m  =  ni=i  ■«»*’  with  %  ^  j  implies  Vj  ^  Vj  and 
for  all  i,  Vi  is  prime.  Then  the  Positional  Protocol 
can  be  easily  modified  to  work  modulo  m,  using  e* 
ui-valued  registers,  for  all  1  <  i  <  f.  Also,  using 
residue  number  systems  we  can  get  an  even  faster 
solution.  For  a  given  m  we  can  look  at  the  prime 
factorization  of  m,  and  then  use  powers  of  primes  as 
small  counters  for  constructing  a  counter  mod  m. 
The  properties  of  residue  number  systems  allow  us 
to  work  on  the  smaller  counters  concurrently. 

3  An  Optimal  Dynamic 
1- Counter 

In  the  previous  section  we  presented  a  space  opti¬ 
mal  static  oo-counter  protocol.  The  situation  with 
dynamic  counters  is  considerably  more  involved, 
and  in  the  general  case  dynamic  counters  will  re¬ 
quire  much  more  space  than  static  ones.  One  spe¬ 
cific  case  where  we  were  able  to  design  a  dynamic 
counter  that  matches  the  trivial  logm  bits  lower 
bound  is  the  case  where  only  one  process  can  in¬ 
crement  the  counter. 

Theorem  3.1  There  is  a  dynamic  1 -counter  pro¬ 
tocol  that  uses  log  m  shared  registers. 

To  prove  the  theorem  we  present  a  protocol,  where 
each  increment  operation  changes  the  value  of  ex¬ 
actly  one  bit,  and  a  look  operation  never  changes 
the  value  of  a  bit.  The  motivation  for  this  pro¬ 
tocol  is  due  to  a  lemma  in  which  we  prove  that 
in  any  dynamic  1-counter  protocol  that  uses  log  m 
bits,  a  process  must  change  the  value  of  exactly 
one  bit  during  a  single  increment  operation,  and 
no  bit  can  be  flipped  during  a  look  operation.  We 
call  such  a  protocol  an  1-flip  counter.  It  is  inter¬ 
esting  to  note  that  dynamic  1-flip  counters  are,  in 
fact,  serial  counters,  since  we  can  order  the  look 
and  increment  operations  of  each  execution  in  a 
complete  order,  which  is  consistent  with  the  partial 
order  defined  by  that  execution,  as  follows: 

1.  First  order  the  increment  operations  accord¬ 
ing  to  the  order  of  the  read-modify-write  in¬ 
struction  that  flipped  the  value  of  a  register. 
Note  that  since  this  is  a  1-flip  protocol,  this 
order  is  well  defined. 


2.  Then,  order  each  look  operation  which  over¬ 
laps  at  least  one  increment  operation,  after 
one  of  the  increment  operations  which  over¬ 
lap  it,  where  the  value  of  the  counter  imme¬ 
diately  after  that  increment  operation  equals 
the  value  returned  by  the  look  operation. 
After  this  step,  all  the  increment  operations 
are  ordered,  and  all  the  look  operations  are 
ordered  relative  to  the  increment  operations. 

3.  Finally,  complete  the  partial  order  of  look  op¬ 
erations  that  appear  between  two  consecutive 
increment  operations  to  a  complete  order  in 
an  arbitrary  way. 

3.1  Preliminaries 

For  many  years.  Gray  code  is  used  for  counting 
[Gar72,  Gil58,  Gra53,  Koh70].  Increments  are 
done,  by  only  one  incrementor,  according  to  the 
code,  while  a  look  operation  simply  reads  the 
counter  digits,  and  convert  them  to  the  appropriate 
number.  This  technique  works  only  if  a  look  oper¬ 
ation  is  much  faster  than  an  increment  operation. 
In  the  case  where  few  increment  operations  are 
concurrent  with  a  look  operation,  the  look  might 
return  a  wrong  answer.  In  our  framework  we  do 
not  assume  anything  about  the  relative  speeds  of 
increment  and  look  operations,  hence  the  above 
naive  use  of  Gray  code  does  not  solve  our  problem. 

Reflected  binary  Gray  code  (abbv.  Gray  code) 
is  a  well  known  method  to  order  all  binary  words 
of  any  given  length  k  in  a  cyclic  order,  such  that 
two  successive  words  differ  in  exactly  one  bit.  It 
is  called  reflected  code  because  it  can  be  generated 
by  the  following  simple  algorithm.  Start  with  0,1 
as  a  one-digit  Gray  code,  then  reflect  and  append 
the  digits  to  get  0,1, 1,0.  Next  put  O’s  in  front  of 
the  first  two  numbers  and  I’s  in  front  of  the  last 
two  numbers.  The  result  is  a  two-digi'i  gray  code 
00,01,11,10.  To  extend  an  i-digit  Gray  code  to  an 
(i  -I-  l)-digit  code,  reflect  the  the  i-digit  code  and, 
as  before,  put  O’s  in  front  of  the  first  half  of  these 
numbers  and  I’s  in  front  of  the  last  half.  Note  that 
the  resulting  code  is  cyclic  in  that  the  first  and 
last  numbers  also  differ  at  only  one  position.  Let 
Gk  =  (50)Si>  •  •  •  be  all  the  binary  words  of 

length  fc  =  logm  ordered  by  Gray  code  (where  go 
is  the  allzero  word).  Let  gray  be  the  1-1  mapping 
defined  by  gray{gi)  =  i.  Our  protocol  represents 
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each  integer  i  6  {0,  •  •  • ,  m  -  1}  by 

An  important  known  property  is  that  the  mapK 
ping  gray  and  its  inverse  gray~^  are  computable 
by  an  on-line  linear  time  algorithm.  To  convert  a 
standard  binary  number  to  its  reflected  Gray  equiv¬ 
alent  start  with  the  digit  at  the  right  and  consider 
each  digit  it  turn.  If  the  next  digit  to  the  left  is 
0,  let  the  former  digit  stand.  If  the  next  digit  to 
the  left  is  1,  change  the  former  digit.  The  leftmost 
digit  is  assumed  to  have  0  on  its  left  and  therefore 
remains  unchanged.  To  convert  back  again  con¬ 
sider  each  digit  in  turn  starting  at  the  right.  If  the 
parity  (sum)  of  all  digits  to  the  left  is  even,  let  the 
digit  stay  as  it  is.  If  the  parity  is  odd,  change  the 
digit. 

There  is  also  a  known  on-line  linear  time  algo¬ 
rithm  to  find  the  successor  of  a  word  in  Gk-  For 
each  A:-bits  word  v  =  [vfc-i  •  •  •  voIj  define  the  critical 
index  of  V  to  be  the  minimal  index  j  in  {0,  •  •  • ,  A:-!} 
such  that  the  k  —  j  bits  prefix  of  v,  [vk-i  •  •  •  Vj], 
has  even  parity,  and  if  there  is  no  such  index  (i.e., 
V  =  [10  ■  •  •  0])  then  j  =  A:  —  1.  Then  the  successor 
of  V  in  Gk  is  obtained  from  v  by  flipping  the  bit  vj, 
where  j  is  the  critical  index  of  v. 

In  the  full  version  we  prove  the  following  proper¬ 
ties  which  are  used  in  the  construction  of  the  pro¬ 
tocol.  For  any  word  w  where  [loj  =  i  <  k,  all  the  k- 
bits  words  whose  prefix  is  w  appear  consecutively 
in  Gk-  Denote  by  first(w)  the  first  word  in  Gk 
that  has  w  asa  prefix.  Similarly,  denote  by  last{w) 
the  last  word  in  Gk  which  has  lo  as  a  prefix.  Fi¬ 
nally,  middle{w)  is  the  the  unique  word  v,  for  which 
grayiv)  =  9r<^y{first{w))+gray{last{w))  par (w)  be 

the  parity  w.  Then  first{w)  =  w  •  par{w)  ■  0*'”*”^ 
(i.e.,  w  followed  by  the  parity  bit  of  w  followed  by 
A:  —  t  —  1  zeroes),  last{w)  =  w  •  ->par{w)  • 
and  middle{w)  =  w  •par{w)  •  1  •  0*'“’“^. 

3.2  The  Gray  code  Counter 

We  can  now  give  an  informal  description  of  the 
Gray  code  protocol.  The  code  of  the  protocol  is 
omitted  from  this  abstract  as  well  as  its  proof, 
which  is  rather  involved.  For  the  rest  of  this  section 
m  and  A:  =  log  m  are  fixed,  and  k  is  the  number  of 
registers  used  for  the  counter. 

The  increment  operation  is  performed  by  read¬ 
ing  the  content  v  =  [v*-!  •  •  •  vq]  of  the  counter  reg¬ 
isters,  and  flipping  where  j  is  the  critical  index 


of  V  described  above.  We  notice  that  since  there  is 
only  one  process  that  may  increment  the  counter, 
it  has  to  read  the  counter  only  one  time,  at  the 
beginning  of  the  protocol. 

The  protocol  for  look  uses  a  function  called  4- 
way  scan  whose  purpose  is  to  take  a  snapshot  of 
the  content  of  the  counter  registers,  and  to  re¬ 
turn  a  word  gi  such  that  t  is  a  valid  value  of  the 
counter.  This  function  is  described  next.  It  first 
reads  the  registers  from  right  to  left  and  gets  a 
word  o  =  [ajk_i  •  •  •  oo]-  That  is,  oo  is  read  first  and 
0)fc_i  is  read  last.  Then  it  reads  the  registers  from 
left  to  right  and  gets  a  word  b  =  [6ik_i  •  •  •  i>o]  (this 
time  bk-i  was  read  first).  Then,  again,  it  reads 
the  registers  from  right  to  left  and  gets  a  word 
c  =  [cfc_i  •••Co],  and,  finally  it  reads  the  registers 
from  left  to  right  and  gets  a  word  d  =  [dk-i  •  •  •  do]- 

Let  v)  be  the  maximal  common  prefix  of  the 
words  a,  b,  c  and  d,  and  let  |u;|  =  i.  In  case  i  <  k, 
let  xi,X2, X3  and  X4  be  the  i -f  1st  bit  of  o,  b,  c  and  d 
respectively.  (I.e.,  w-xi  is  a  prefix  of  a.)  The  func¬ 
tion  checks  the  following  conditions  sequentially: 

1.  if  a  =  6  then  return  a; 

2.  elseif  c  —  d  then  return  c; 

3.  elseif  jwl  =  0  then  return  /ost([ajk_i]); 

4.  elseif  x\  =  -ipar(w)  and  X2  =  X3  =  X4  = 
par{w)  then  return  first{wy, 

5.  elseif  xi  X2  —  X3  =  ->par{w)  and  X4  = 
par(w)  then  return  last{wy, 

6.  elseif  xi  =  X2  =  ->par{w)  and  X3  =  X4  = 
par{w)  then  return  lost(tc); 

7.  elseif  xi  =  X3  =  ->par{w)  and  X2  =  X4  = 
par{w)  then  return  laat{w). 

8.  elseif  all  the  above  conditions  fail  then  return 
middle{w). 

Using  the  .^-way  scan.,  it  is  now  easy  to  describe 
the  implementations  of  the  look  operation.  The 
look  operation  calls  the  .^-way  scan  function  that 
returns  a  k  bit  word,  g.  The  output  of  the  look 
operation  is  gray{g). 

4  Dynamic  Counters  for  Many 
Processes 

In  this  section  we  present  two  dynamic  counters. 
The  first  works  when  the  number  of  increment 
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operations  is  bounded;  the  other  works  when 
the  number  of  processes  is  bounded  and  known, 
and  the  number  of  increment  operations  is  un¬ 
bounded. 

4.1  The  Cyclic-flip  Counter 

In  this  subsection  we  present  an  m-bounded  dy¬ 
namic  oo-counter  protocol  -  that  is,  a  protocol  that 
works  as  long  as  no  more  than  a  total  of  m  incre¬ 
ments  are  performed.  Counters  of  this  type  are 
used  (implicitly)  in  wakeup  and  consensus  proto¬ 
cols. 

Theorem  4.1  There  is  an  m-bounded  dynamic 
oo-counter  protocol  that  uses  m  registers. 

The  protocol  presented  here,  called  the  Cyclic-flip 
protocol,  uses  m  registers,  which  is  exponentially 
more  than  required  by  the  dynamic  1-counter  of 
the  previous  section.  In  the  next  section  we  show 
that  an  exponential  gap  in  the  number  of  registers 
is  unavoidable,  even  if  it  is  assumed  that  all  the 
registered  are  initialized,  and  the  processes  are  al¬ 
lowed  to  be  distinct. 


4.1.1  The  Counting  Method 

Like  the  Positional  and  the  Gray  code  counters, 
the  Cyclic-flip  counter  is  based  on  a  mapping  from 
binary  words  onto  {0, 1,  •  •  • ,  m  -  1},  where  m  is  a 
power  of  two.  However,  this  time  the  domain  of  the 
mapping  is  the  words  of  length  m,  and  not  of  length 
logm  as  in  the  previous  cases.  The  mapping,  called 
number,  is  deflned  as  follows; 

Let  u;  be  a  binary  word  of  length  m  (where  m  is 
a  power  of  2),  and  let  par{w)  be  the  parity  of  w. 
When  m  >  1,  let  «;  =  W2  ■  where  |u;2|  =  |w|/2. 
Then, 


number{w) 


=  { 


0 

2  •  number{w2)  +  par{w) 


if  |ii)|  =  1; 
otherwise. 


That 

is,  number{w)  =  '*'(*)  ’  2’,  where  v{i)  — 

par(w{m  —  !)•••  w{m  — 

(v(i)  is  the  parity  of  the  p  leftmost  bits  of  w.) 
Thus,  for  m=8,  num6er(101 10001)  =  2^  -I-  2^  =  6, 
since  10, 1011  have  parity  1,  and  10110001  has  par¬ 
ity  0. 


Given  a  word  u>,  the  increment  operation  is 
done  by  finding  a  word  u/  such  that  number(u/)  = 
number (w)  -h  1.  For  this  we  use  the  function  next, 
which  we  define  below. 

Let  to  be  a  binary  word  of  length  m  =  2*.  When 
m  >  1,  let  to  =  t02  •  toi  where  |toi|  =  |t02|  =  |to|/2. 
Then, 


next{w)  =  < 


[0] 

[1] 

t02  ■  next{wi) 
nextiwi)  •  toi 


if  to  =  [1]; 
if  to  =  [Oj; 

if  por(toi)  =  par{w2y, 
if  par{wi)  ^  par{w2). 


We  define  next^{w)  recursively  as  follows: 
next^{w)  =  to  and  next^{w)  =  next{next^~^{w)), 
for  f  >  0. 


Lemma  4.1  Let  w  be  a  binary  word  of  length  m. 
Then,  nexlf{w)  and  to  differ  by  exactly  i  bits,  for 
any  0  <  I  <  m. 

Lemma  4.1  implies  that  the  next  function  partitions 
the  set  of  all  binary  words  of  length  m  into  T^jlm 
cycles  of  length  2m  each.  Each  such  cycle  consists 
of  2m  words  too,ioi,  •  •  •  i02m-i>  where  next{wi)  = 
Wi+i(mod  2m)-  Furthermore,  2m)  is  ob¬ 

tained  from  tOj  by  flipping  one  bit,  and  in  any  m 
successive  applications  of  next,  each  bit  is  flipped 
exactly  once. 

The  following  lemma  implies  that  when  the  value 
of  the  counter  is  given  by  the  function  number, 
the  next  function  can  be  used  for  implementing  the 
increment  operation. 

Lemma  4.2  Let  w  be  a  binary  word  of  length 
m.  Then,  number{next{w))  =  number{w)  -)• 
1  (mod  m). 


4.1.2  The  Protocol 

We  are  now  ready  to  give  a  description  of  the 
Cyclic-flip  protocol.  The  protocol  uses  m  registers. 
The  value  of  the  counter  is  the  non-negative  integer 
we  get  when  applying  the  number  function  to  the 
content  of  the  m  shared  registers  rm-i,  •..,ro.  The 
code  for  the  Cyclic  flip  protocol  is  omitted  from 
this  abstract.  Below  we  give  a  description  of  its 
operations. 

The  look  operation  calls  the  double-scan  func¬ 
tion  which  reads  all  the  m  counter  registers  sequen¬ 
tially  in  a  cyclic  order,  until  it  reads  for  two  consec¬ 
utive  times  the  same  values  for  all  the  registers,  and 
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output  this  m  bit  word.  Then  the  number  function 
is  applied  to  the  m  bit  output  of  the  double-scan 
function  and  the  result  is  a  number  v.  The  output 
of  the  look  operation  is  the  number  v. 

The  increment  operation  uses  first  the  double¬ 
scan  function  to  get  a  correct  snapshot  of  the 
counter  register.  Then,  the  next  function  is  used 
to  find  what  bit  has  to  be  flipped  in  order  to  in¬ 
crement  the  counter.  Finally  if  nobody  else  flipped 
this  bit  yet  then  this  bit  is  flipped,  otherwise  we 
start  all  over  again. 

We  point  out  that  the  space  complexity  of  the 
Cyclic-flip  protocol  (almost)  matches  the  m  -  1 
lower  bound  which  follows  from  the  second  theo¬ 
rem  in  the  next  section. 

Notice  that  in  the  Cyclic-flip  counter,  whenever 
a  process  completes  an  increment  operation,  it 
“knows”  the  current  value  of  the  counter.  This 
fact  is  used  in  the  solution  in  the  next  subsection. 

4.2  Dynamic  n-counter  Protocol 

In  this  subsection  we  combine  the  Gray  code  pro¬ 
tocol  and  the  Cyclic-flip  protocol  to  obtain  an  effi¬ 
cient  (unbounded)  dynamic  n-counter  protocol. 

Theorem  4.2  For  any  n,  there  is  a  dynamic  n- 
counter  protocol  that  uses  a(log  m  -I- 1)  shared  reg¬ 
isters,  where  a  is  the  smallest  power  of  2  that  is 
not  smaller  than  n. 

Let  a  be  the  smallest  power  of  2  that  is  not  smaller 
than  n.  Each  process  starts  the  execution  by  as¬ 
signing  itself  a  unique  identifier.  This  is  done  us¬ 
ing  the  Cyclic-flip  counter  protocol  described  in  the 
previous  subsection.  We  use  a  registers  to  imple¬ 
ment  such  a  counter.  Each  process  as  it  wakes  up 
increments  this  counter  by  1,  and  assigns  itself  the 
new  value. 

Apart  from  this  counter  there  are  n  other  coun¬ 
ters  numbered  0  through  n  -  1,  each  consisting 
of  logm  registers.  One  a  process  assigns  it¬ 
self  an  identifier,  say  i,  it  considers  counter  i  as 
its  own  local  counter  that  only  it  can  increment 
and  everybody  else  can  read.  A  process  executes 
an  increment  operation  by  incrementing  its  local 
counter  using  the  Gray  code  protocol.  A  process 
executes  a  look  operation  using  the  4-way  scan 
operation  described  in  the  Gray  code  protocol,  on 
each  of  the  n  local  counters,  and  then  sums  up  the 
results. 


5  Lower  Bounds 

In  this  section  we  prove  lower  bounds  on  the  num¬ 
ber  of  registers  needed  to  implement  dynamic  coun¬ 
ters.  Since,  by  definition,  any  serial  counter  is  also 
a  dynamic  counter,  obviously  all  the  lower  bounds 
for  dynamic  counters  hold  also  for  serial  counters. 
Since,  when  m  =  2  we  can  easily  solve  the  counter 
problem  using  one  shared  bit,  we  assume  from  now 
on  that  m  >  2.  All  the  results  that  we  prove  in  this 
section  hold  even  if  it  is  assumed  that  look  opera¬ 
tions  are  atomic.  Recall  that  an  l-bounded  dynamic 
n-counter  is  a  dynamic  n-counter  where  a  total 
of  at  most  I  increment  operations  are  performed 
in  any  of  its  runs.  Note  that  min{log  m,  log  is 
a  trivial  lower  bound  on  the  number  of  registers 
needed  to  implement  such  a  counter.  The  main 
result  of  this  section  is  the  following: 

Theorem  5.1  Let  k  =  min(2^,  y/^^). 

Any  l-bounded  dynamic  n-counter  protocol  must 
use  at  least  k  registers.  This  bound  holds  even  when 
the  processes  are  distinct  and  there  is  only  one  pos¬ 
sible  initial  state. 

Proof:  Assume  to  the  contrary  that  there  exists  an 
^-bounded  dynamic  n-counter  protocol  that  uses 
k  <  min(2^,  registers.  This  implies 

that  n  >  2k,  Tn/2  >  k,  and  I  >  3k^.  We  show  how 
this  assumption  leads  to  a  contradiction. 

To  make  the  result  stronger  we  assume  that  there 
is  only  one  possible  initial  state.  To  simplify  the 
presentation  we  assume  that  in  this  sinlge  initial 
state  the  value  of  the  counter  is  zero. 

For  b  G  {0, 1},  we  say  that  process  p  is  b-loaded 
for  register  r  in  a  run  z  if  the  event  rmwp(r,  6,  ->6) 
is  enabled  at  z.  That  is,  if  p  takes  a  step  next, 
it  is  going  to  flip  the  bit  in  r  from  b  to  -i6.  We 
notice  that  by  axiom  RMWl,  if  p  is  fr- loaded  for 
r  at  X  then  p  is  also  fr-load  for  r  at  any  run  that 
is  indistinguishable  to  p  &om  z,  provided  that  the 
value  of  r  at  that  run  is  b.  We  now  construct  a  run 
p  as  follows: 

We  repeat  the  following  procedure  at 
most  2k  times.  Starting  with  i  =  1  we 
let  process  pi  repeatedly  increment  the 
counter  until  one  of  the  following  two  situ¬ 
ations  happens:  (1)  pi  becomes  the  first  6- 
loaded  process  for  some  register  r,  (and  in 
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this  case  we  say  that  p  is  suspended  on  r); 
or  (2)  Pi  completes  an  increment  opera¬ 
tion  and  the  total  number  of  increment 
operations  that  have  begun  so  far  is  i.  If 
(1)  happens,  and  we  still  have  not  sus¬ 
pended  2k  processes,  then  we  activate 
process  pi+i  according  to  the  above  proce¬ 
dure  (and  increment  i  by  one).  The  con¬ 
struction  terminates  when  either  (2)  hap¬ 
pens  or  2k  processes  have  been  suspended 
already. 

We  note  that  for  each  register  at  most  two  pro¬ 
cesses  may  become  suspended  on  it  (one  may 
become  0-loaded  and  the  other  may  become  1- 
loaded),  and  since  n  >  2A:  we  have  always  a  process 
to  activate  when  needed. 

We  call  an  increment  operation  that  was  com¬ 
pleted  in  the  run  p  reversible  if  for  any  register 
that  was  changed  during  this  operation  there  were 
already  two  processes  suspended  on  it,  and  the 
increment  operation  is  irreversible  otherwise.  It 
is  not  difficult  to  see  that  at  most  k  increment 
operations  in  p  are  irreversible.  This  follows  from 
the  fact  that  every  register  can  be  changed  at  most 
once  during  the  run  p  while  there  is  only  one  pro¬ 
cess  suspended  on  it. 

Next  we  show  that  if  2k  processes  are  suspended 
at  p  then  we  are  done.  If  this  is  the  case,  then  for 
any  of  the  k  registers  there  is  a  0-loaded  process  and 
a  1-loaded  process  suspended  on  it.  This  means 
that  by  activating  some  of  these  processes  we  have 
full  control  over  the  value  of  the  counter. 

For  the  rest  of  the  proof,  ©  denotes  addition 
modulo  m.  Let  p  be  such  a  run,  and  let  I  = 
end{p)  ©  k,  where  end{p)  is  the  number  of  incre¬ 
ments  that  have  been  completed  in  p.  W.l.o.g.  as¬ 
sume  that  when  the  value  of  the  counter  is  7  © 
{k  +  1),  the  values  of  the  counter  registers  rfc,  ...,ri 
are  v*, ...,  vi,  respectively;  and  that  in  the  run  p,  for 
each  register  rj  process  pi  is  (-'t;i)-loaded  foi  ri  in  p. 
Activate  all  suspended  processes  except  processes 
Vii—iVk  (i°  some  order)  and  let  them  terminate. 
Now  we  still  have  the  k  processes  suspended  and 
we  can  activate  all  of  or  subset  of  them  (depending 
on  the  current  values  of  the  counter  registers)  and 
get  the  counter  equal  7  ©  (A;  -I- 1).  That  is,  we  can 
extend  p  to  a  run  p'  where  count(p')  =  I  ®(k  +  l) 
and  end{pf)  =  I.  Since  at  most  k  processes  are 
suspended  in  p',  by  Dl,  count{p')  must  lie  in  the 


cyclic  interval  [7,7  ©  k].  Since  it  is  assumed  that 
k  <  m/2,  I  @  {k  +  1)  does  not  lie  in  the  cyclic 
interval  [7,7©  A:],  a  contradiction. 

So,  let  us  assume  from  now  on  that  at  most  2A;-1 
processes  are  suspended  at  p.  This  implies  that  the 
total  number  of  increment  operations  that  have 
begun  in  p  is  exactly  £.  Hence,  the  total  number  of 
(completed)  reversible  increment  operations  is  at 
least  £  —  {2k  —  l)  —  k.  We  call  each  maximal  interval 
of  p  which  contain  only  reversible  increment  oper¬ 
ation,  a  segment.  (I.e.,  a  segment  does  not  contain 
any  irreversible  increment  operations  and  any  op¬ 
erations  by  suspended  processes.)  We  notice  that 
two  consecutive  segments  are  separated  by  one  or 
more  irreversible  increment  operations  or  opera¬ 
tions  by  (eventually)  suspended  processes.  Since 
there  are  at  most  k  irreversible  increment  opera¬ 
tions,  and  at  most  2A:  -  1  processes  are  suspended 
at  p,  there  are  at  most  3A:  segments.  Since  there 
are  at  least  7  —  3A:  +  1  reversible  increment  oper¬ 
ations,  £  >  3k^  and  k  is  an  integer,  there  must  be 
a  at  least  one  segment  which  includes  (at  least). 


reversible  increment  operations. 

Let  X  and  y  be  two  prefixes  of  p  (x  <  y) 
such  that  {y  —  x)  includes  exactly  k  reversible 
increment  operations,  does  not  include  any  irre¬ 
versible  increment  operations,  all  increment  op¬ 
erations  in  X  and  y,  except  those  of  processes  that 
are  suspended  in  x,  has  been  completed,  and  no 
new  processes  are  suspended  in  (y  —  x).  Such  two 
runs  must  exist  by  the  argument  in  the  previous 
paragraph. 

We  note  that  for  any  register  r  that  is  changed 
during  (y— x)  it  must  be  the  case  that  two  processes 
are  suspended  on  r  at  x.  This  follows  from  the  fact 
that  (y  —  x)  consists  of  only  completed  reversible 
increment  operations. 

By  definition  of  reversible  increment  operations 
we  can  now  activate  some  of  the  processes  that  are 
suspended  at  y  to  get  and  extension  z  of  y  such 
that  the  values  of  all  the  counter  registers  are  the 
same  in  x  and  z.  This  implies  that  count{x)  = 
count{z).  As  already  mentioned,  any  of  the  k  re¬ 
versible  increment  operations  in  (y— x),  may  only 
be  involved  in  changing  the  value  of  a  register  for 
which  there  is  exactly  two  processes  suspended  on 
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it.  Thus,  it  is  not  difficult  to  observe  that,  to  get 
z  from  y  it  is  enough  to  activate  at  most  A:  —  1  of 
the  2A:  —  1  processes  that  may  be  suspended  at  y. 
Also,  the  runs  x  and  z  are  indistinguishable  to  ev¬ 
ery  process  suspended  in  x  that  is  not  involved  in 
some  event  in  {z  ~y).  We  notice  that  any  process 
that  is  suspended  in  x  is  also  suspended  in  y  (but 
not  necessarily  visa  versa).  Finally,  we  extend  x 
and  z  to  runs  x'  and  z'  respectively,  in  exactly  the 
same  way,  by  letting  all  the  suspended  process  in 
X  that  are  not  involved  in  some  event  in  (z  —  y) 
complete  their  increment  operations. 

Next  we  show  that  at  most  A:  —  1  processes  are  in 
the  middle  of  an  increment  operation  in  z'  (and 
also  in  x').  From  the  construction,  it  must  be 
the  case  that  each  of  the  processes  which  are  still 
involved  in  an  increment  operation  in  z'  is  sus¬ 
pended  in  y.  Hence  it  is  enough  to  show  that 
for  each  register  r,  there  is  at  most  one  process 
that  was  suspended  on  r  in  j/  and  is  involved  in  an 
increment  operation  in  z'.  There  are  two  possible 
cases: 

1.  There  are  two  processes  suspended  on  r  in  x. 
Then  at  least  one  of  them  is  not  activated  in 
(z  ~y),  and  can  be  activated  in  {z'  -  z). 

2.  Less  than  two  processes  are  suspended  on  r  in 
X.  As  noted  above  this  implies  that  r  was  not 
changed  in  (y  -  x)  and  hence  no  process  that 
is  suspended  on  r  is  activated  in  (z  —  y).  Then, 
if  a  process  is  suspended  on  r  at  it  can  be 
activated  in  (z'  -  z). 

Since  at  most  2A:  —  1  processes  are  suspended  at  x, 
at  most  A:  -  1  processes  are  activated  in  (y  —  x). 
Thus,  at  most  A:  —  1  processes  are  in  the  middle  of 
an  increment  operation  in  z'  (or  in  x'). 

From  the  construction,  using  the  RMW  axioms  it 
follows  that  the  values  of  all  the  counter  registers 
are  the  same  in  x'  and  z'  and  thus  count{x')  = 
count(z').  On  the  other  hand,  let  end{x')  be  the 
number  of  increments  that  have  been  completed  in 
x'.  Since  at  most  fc  —  1  processes  are  in  the  middle 
of  an  increment  operation  in  x',  count{x')  must 
lie  in  the  cyclic  interval  [end{x'),  end{x')  ©  (A:  — 
1)].  Also,  since  tnd{z')  >  end{x')  +  k  and  since 
at  most  A:  -  1  processes  are  in  the  middle  of  an 
increment  operation  in  z\  count{z')  must  lie  in 
the  cyclic  interval  [end{x')  ©  k,  end{x')  ®{k  +  k- 


1)].  Since,  k  <  m/2,  it  must  be  that  count{z')  ^ 
count(x'),  a  contradiction.  I 

By  slightly  modifying  the  proof  of  Theorem 
5.1,  we  can  prove  the  following;  Let  k  = 
min(^^^,  2^,  y/i  -I-  4,25  —  1.5).  Any  ^-bounded 
dynamic  n-counter  protocol  must  use  at  least  k 
registers.  (Again,  this  bound  holds  even  when  the 
processes  are  distinct  and  there  is  only  one  possible 
initial  state.) 

It  follows  from  Theorem  5.1  that  any  dynamic  n- 
counter  protocol  must  use  at  least  min(^i^,  2ii) 
bits.  Next,  by  making  various  restrictions  on 
the  way  processes  may  increment  the  counter,  we 
tighten  the  lower  bound.  Recall  that  a  protocol  is 
a  1-flip  protocol  if  a  process  may  change  the  value 
of  only  one  register  during  a  single  increment  op¬ 
eration. 

Theorem  5.2  Let  k  =  min(/,n).  Any  1-flip  £- 
bounded  dynamic  n-counter  protocol  must  use  at 
least  A:  —  1  registers. 

Proof:  Assume  to  the  contrary  that  k  —  2  registers 
are  sufficient.  Consider  a  run  p  where  A:  —  1  differ¬ 
ent  processes  increment  the  counter  in  a  sequential 
manner,  one  time  each,  starting  from  some  arbi¬ 
trary  initial  state.  Since  there  are  only  A:  —  2  reg¬ 
isters  the  value  of  at  least  one  register,  say  r,  is 
changed  at  least  two  times.  Let  pi  be  the  process 
that  was  the  first  to  change  r  and  let  p2  be  the  pro¬ 
cess  that  was  the  second  to  change  r.  Let  p2  be  a 
prefix  of  p  where  P2  just  completed  its  increment 
operation,  and  let  pi  be  a  prefix  of  p2  where  P2  is 
just  about  to  start  its  increment  operation. 

Next  we  construct  the  run  ps  as  follows.  Let  pa 
be  a  process  that  did  not  participate  in  p2-  We 
construct  pa  by  first  activating  the  processes  ex- 
sw;tly  as  in  p2  until  the  point  where  pi  is  about  to 
change  r  for  the  first  time.  Then  we  activate  pa 
until  it  is  also  about  to  change  r.  This  will  happen 
since  processes  pi  and  pa  are  identical.  We  then 
suspend  pa  and  let  the  run  continue  as  in  pa-  Fi¬ 
nally,  after  pa  change  the  value  of  r  for  the  second 
time  (and  completed  its  increment  operation)  we 
activate  pa  and  let  it  changes  r  for  the  third  time 
and  complete  it  increment  operation.  Notice  that 
between  the  point  where  pa  was  suspended  and  the 
point  when  it  was  activated  again  the  value  of  r  is 
changed  exactly  two  times  and  hence  pa  will  not 
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aotice  that  r  has  been  changed  and  'will  change  r 
when  it  is  activated  again. 

Since  in  p\  the  register  r  was  changed  once  and 
in  pz  it  was  changed  three  times  the  values  of  all 
registers  in  these  two  runs  are  the  same  and  hence 
count{pi)  =  count{pz).  However,  the  number  of 
increment  operation  in  pz  is  greater  by  exactly 
two  than  in  pi,  and  since  it  is  assumed  that  m  >  2, 
we  reach  a  contradiction.  I 

It  follows  from  Theorem  5.2  that  there  does  not 
exist  1-flip  dynamic  oo-counter  protocol  when  us¬ 
ing  only  binary  registers.  Theorem  5.2  can  eas¬ 
ily  be  generalized  to  show  that  when  m  >  v,  any 
1-flip  ^-bounded  dynamic  n-counter  protocol  must 
use  at  least  (A:  — l)/(t;  — 1)  u- valued  registers,  where 
k  =  min(/,n). 

The  implementations  of  the  increment  opera¬ 
tion  in  all  the  protocols  we  present  in  this  pa¬ 
per,  except  the  protocol  in  Section  4.2,  have  the 
property  of  being  independent  of  the  specific  state 
of  the  process  that  executes  them.  In  every  sin¬ 
gle  increment  operation  the  final  content  of  the 
counter  is  determined  only  by  its  initial  content. 
We  call  such  a  counter  protocol  a  memory-less 
counter  protocol.  For  such  protocols,  we  have  a 
stronger  version  of  Theorem  5.2. 

Theorem  5.3  Any  1-flip  memory-less  l-bounded 
dynamic  2- counter  protocol  must  use  at  least  i  —  1 
registers. 

The  proof  of  Theorem  5.3  is  similar  to  that  of  Theo¬ 
rem  5.2  except  that  in  the  construction  of  the  run  p 
in  the  previous  proof,  instead  of  activating  k—1  dif¬ 
ferent  processes,  we  activate  only  one  process  and 
let  it  increment  the  counter  A:  —  1  times. 

We  point  out  that  Theorem  5.3  does  not  contra- 
lict  Theorem  4.2,  since  the  protocol  described  in 
section  4.2  is  not  a  memory-less  or  a  1-flip  proto- 
:ol.  As  in  Theorem  5.2,  it  follows  from  Theorem 
).3  that  there  does  not  exist  a  1-flip  memory-less 
lynamic  2-counter  protocol  when  using  only  binary 
egisters.  Finally,  we  notice  that  the  last  two  the- 
>rems  hold  also  for  static  counters  (the  proofs  are 
ixactly  the  same).  This  fact  does  not  contradict 
Theorem  2.1,  since  the  Positional  protocol  is  not  a 
’-flip  protocol. 


6  Discussion 

We  study  a  new  basic  problem  -  the  concurrent 
counter  problem  -  in  a  model  where  no  assump¬ 
tion  is  made  about  the  initial  state  of  the  shared 
memory.  We  design  efficient  protocols  for  solving 
the  problem  and  prove  several  space  lower  bounds. 

Let  the  time  complexity  be  the  total  number  ac¬ 
cesses  to  the  shared  memory  in  order  to  complete 
a  (look  or  increment  )  operation.  The  time  com¬ 
plexity  of  both  the  look  and  increment  opera¬ 
tions  in  the  Positional  protocol  is  logm.  In  the 
Gray-code  protocol  the  time  complexity  of  the  look 
operation  is  4 logm,  and  of  the  increment  oper¬ 
ation  is  1  apart  from  the  first  increment  which 
takes  logm  -t-  1.  As  for  the  Cyclic-flip  protocol, 
in  the  absence  of  contention,  the  time  complexity 
of  the  look  operation  is  m,  and  of  the  increment 
operation  is  m  -f- 1.  When  there  is  contention  the 
complexity  can  be  in  the  worst  case  m^  -f  2m  for 
the  look  operation  and  2m^4-m  for  the  increment 
operation.  As  for  the  protocol  discussed  in  subsec¬ 
tion  5.2,  the  complexity  of  the  look  operation  is 
nlogm,  while  that  of  the  the  increment  opera¬ 
tion  is  1,  apart  from  the  first  increment  operation 
which  may  take  2m^  4-  m  ■+  1. 

There  are  still  many  interesting  open  questions 
related  to  concurrent  counters.  Some  of  these  prob¬ 
lems  are  listed  below. 

The  lower  bound  in  Theorem  5.1  is  not  tight  for 
i.  Improving  this  lower  bound  (or  the  correspond¬ 
ing  upper  bound)  may  also  help  in  improving  the 
related  time  bound,  and  may  have  implications  on 
the  bounds  in  [FMRT90]. 

Generalize  the  results  to  counters  which  use  reg¬ 
isters  of  constant  size  larger  than  one  bit.  For  in¬ 
stance,  implement  Gray-codes  of  alphabets  of  size 
grater  than  2  for  optimal  dynamic  1-counters.  This 
seems  plausible  for  alphabets  of  even  size. 

Generalize  or  modify  counters  to  objects  that 
support  a  wider  variety  of  operations.  For  exam¬ 
ple,  a  natural  generalization  whose  implementation 
raises  non-trivial  problems  is  obtained  by  extend¬ 
ing  the  counter  definition  to  allow  a  decrement  op¬ 
eration,  which  decreases  the  value  of  the  counter 
by  one.  Another  operation  is  to  reset  the  counter 
to  some  default  value. 
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Abstract 

Two  implementations  of  a  multi-writer,  multi¬ 
reader,  atomic  register  are  presented.  The  physi¬ 
cal  registers  used  by  the  first  implementation  are 
single-writer,  multi-reader,  atomic  registers;  the 
physical  registers  used  by  the  second  implementa¬ 
tion  are  single-reader,  single-writer,  atomic  reg¬ 
isters.  Both  implementations  are  optimal  with 
respect  to  the  two  most  important  complexity  cri¬ 
teria:  In  both  implementation  the  space  complex¬ 
ity  is  logarithmic,  thus  matching  the  lower  bound 
proven  by  Cori  and  Sopena;  and  the  time  com¬ 
plexity  is  linear,  thus  matching  the  obvious  lower 
bound.  These  implementations  improve  upon  the 
space  complexity  of  all  previous  implementations 
in  their  respective  classes,  by  an  exponential  fac¬ 
tor. 


1  Introduction 

At  the  most  basic  level  of  interprocessor  commu¬ 
nication,  data  is  transferred  via  registers —  mem¬ 
ory  devices  which  support  read  and  write  opera¬ 
tions.  Each  register  can  store  any  element  from 
its  set  of  permitted  values-,  it  has  a  set  of  writers 
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—  processors  that  can  write  values  in  the  regis¬ 
ter,  and  a  set  of  readers  —  processors  that  can 
read  values  from  the  register.  A  write  operation 
takes  some  permitted  value  as  a  parameter  and 
stores  it  in  the  register;  a  read  operation  returns 
a  (permitted)  value  stored  in  the  register.  The 
stored  and  returned  values  should  satisfy  some 
consistency  guarantees  which  depend  on  the  type 
of  the  register.  Lamport  in  [La86a,  La86b]  was 
the  first  to  formalize  the  notion  of  a  register.  He 
defined  three  types  of  registers  in  terms  of  the 
consistency  guarantees  they  supply:  safe,  regu¬ 
lar  and  atomic.  An  atomic  register  supports  read 
and  write  as  atomic,  indivisible,  operations.  Safe 
and  regular  registers  make  inferior  consistency 
guarantees  which  are  not  discussed  in  this  paper. 
The  execution  of  an  operation  is  called  an  action. 
Under  the  global  time  model,  which  we  assume 
throughout  this  paper,  each  action  has  a  starting 
time  and  an  ending  time.  The  interval  between 
the  starting  and  ending  times  of  an  action  a  is 
called  the  execution  interval  of  a.  Each  writer  or 
reader  executes  actions  in  a  serial  manner,  but  we 
assume  absolutely  no  synchronization  among  dif¬ 
ferent  processors.  Actions  of  distinct  processors 
might  be  executed  in  overlapping  time-periods. 

In  a  situation  in  which  some  specific  type  of 
register  is  not  available  at  the  local  hardware 
store,  one  may  resort  to  implementing  the  re¬ 
quired  register  using  some  available  registers.  For 
the  user,  such  an  implementation  is  a  black  box 
which  behaves  exactly  according  to  its  specifica¬ 
tions.  Formally,  an  implementation  of  a  logical 
register  using  a  set  of  physical  registers,  consists 
of  a  hardware  arrangement  of  the  physical  reg¬ 
isters  and  two  programs  that  are  called  a  writer 
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protocol  and  a  reader  protocol.  Both  programs 
are  composed  of  operations  of  the  physical  reg¬ 
isters  which  are  called  physical  operations  and 
constitute  the  operations  of  the  logical  register 
which  are  called  the  logical  operations.  The  set 
of  processors  is  partitioned  into  logical  writers 
and  logical  readers,  that  is  writers  and  readers 
of  the  logical  register.  For  simplicity  we  assume 
that  each  processor  is  either  a  logical  writer  or  a 
logical  reader  though  in  reality  a  processor  can 
function  as  both.  Assuming  that  each  physical 
register  supplies  its  consistency  guarantees,  the 
protocols  should  satisfy  the  consistency  guaran¬ 
tees  of  the  logical  register.  In  case  the  logical 
register  is  atomic  it  is  possible  to  assign  a  serial¬ 
ization  time  for  each  logical  action.  The  serial¬ 
ization  time  induces  a  serialization  of  all  logical 
actions  in  the  execution.  The  atomicity  of  the 
implemented  register  implies  that  every  read  ac¬ 
tion  returns  the  value  written  by  the  write  action 
which  is  the  most  recent  preceding  write  under 
the  induced  serialization.  To  avoid  trivial  solu¬ 
tions  it  is  required  that  the  reader  and  writer 
protocols  are  wait-free  and  that  each  action  is  se¬ 
rialized  within  its  execution  interval. 

Throughout  this  paper  w  and  r  are  used  for 
the  number  of  writers  and  readers,  respectively, 
and  n  =  it;  -b  r  is  the  total  number  of  pro¬ 
cessors  in  the  system.  A  register  with  w  writ¬ 
ers  and  r  readers  is  denoted  as  a  (it;,  r)-register. 
In  [La86a,  La86b]  Lamport  presented  five  im¬ 
plementations  of  various  logical  registers  with  a 
single  writer.  When  some  of  these  implementa¬ 
tions  are  combined,  they  form  an  implementation 
of  an  atomic  (1,  l)-register  with  any  value  set, 
from  binary,  safe,  (1,  l)-registers.  Several  papers 
which  were  motivated  by  the  work  of  Lamport, 
studied  the  intriguing  problem  of  implementing 
atomic,  multi-writer,  multi-reader  registers.  The 
simplest  such  an  implementation  was  presented 
by  Vitanyi  and  Awerbuch  in  [VA86].  They  im¬ 
plement  an  atomic,  (in,  r)- register,  using  atomic, 
(1,  l)-physical  registers.  In  this  implementation 
the  physical  registers  are  divided  into  two  fields: 
a  value  field  which  stores  the  value  and  is  not 
used  by  the  writer  and  reader  protocols,  and  a 
coordination  field  which  stores  all  the  informa¬ 
tion  needed  for  the  implementation.  In  previ¬ 


ous  papers  the  coordination  field  is  called  the  la¬ 
bel  field;  we  reserve  the  term  label  for  the  most 
basic  coordination  unit  and  present  implementa¬ 
tions  in  which  each  coordination  field  consists  of 
several  labels.  We  call  implementations  in  which 
each  register  is  divided  into  a  value  field  and 
a  coordination  field  label  based  implementations. 
The  complexity  of  a  label-based  implementation 
is  measured  by  several  criteria.  The  two  most 
important  criteria  are: 

1.  Space  Complexity  -  The  maximal  size  of 

a  coordination  field  of  any  physical  register. 

(This  criterion  is  often  called  label-size  ). 

2.  Time  Complexity  -  The  maximal  number 

of  physical  actions  executed  during  a  single 

logical  read  or  write  operation. 

In  ihe  [VA86]  implementation  labels  are  time- 
stamps.  This  causes  its  main  drawback,  namely: 
unboundedness.  The  actual  size  of  a  label  in  any 
logical  action  is  logarithmic  in  the  number  of 
write  actions  performed  prior  to  that  action.  The 
time  complexity  of  this  implementation  is  linear 
in  n,  the  total  number  of  processors. 

Several  researchers  have  devised  bounded 
label-based  implementations  for  atomic,  multi¬ 
writer,  multi-reader  registers,  using  single- writer, 
multi-reader  physical  registers:  The  first  imple¬ 
mentation  was  proposed  in  [VA86]  and  was  found 
to  be  erroneous.  The  second  implementation  was 
proposed  by  Peterson  and  Burns  in  [PB87]  — 
this  implementation  has  a  bug  which  was  dis¬ 
covered  and  corrected  by  Schaffer  in  [Sc89].  In 
this  implementation  the  space  complexity  is  0{w) 
and  the  time  complexity  is  0{w^).  Israeli  and 
Li,  in  [IL87],  suggested  bounded  time-stamps  as 
a  bounded  primitive  to  capture  the  temporal  re¬ 
lationship  among  asynchronous  processors.  Us¬ 
ing  this  method  they  devised  an  implementa¬ 
tion  which  runs  in  linear  time  and  with  0(n) 
space  complexity.  Another  implementation  with 
an  inferior  complexity  is  proposed  by  Abraham 
in  [AB91].  The  work  of  Li,  Tromp  and  Vi¬ 
tanyi  in  [LTV90]  presents  an  implementation  us¬ 
ing  atomic  (1,1)  physical  registers.  The  space 
complexity  is  0(n)  and  the  time  complexity  is 
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linear.  Since  the  implementation  of  [LTV90]  uses 
inferior  physical  registers  its  complexity  is  supe¬ 
rior  to  all  aforementioned  implementations. 

Thus  far  all  proposed  bounded  concurrent  im¬ 
plementations  have  a  space  complexity  which  is 
iinear  in  the  number  of  writers,  and  in  some  cases 
even  in  the  total  number  of  processors.  Some  ex¬ 
planation  for  this  phenomena  was  suggested  by 
Israeli  and  Li  in  [IL87]  who  defined  the  class  of 
Binary  Comparison  Protocols  (or  in  short  BCP) 
as  the  class  of  label-based  implementations  in 
which  the  labels  of  every  two  processors  can  be 
compared  to  find  the  most  recent  label  among 
the  two.  They  showed  that  the  space  complex¬ 
ity  of  any  BCP  implementation  is  at  least  lin¬ 
ear  in  the  number  of  writers.  Later  Li  and  Vi- 
tanyi  in  [LV90]  have  pointed  out  that  though  the 
original  [VA86]  implementation  is  BCP  as  well 
as  all  bounded  implementations,  in  principal  an 
implementation  of  an  atomic  register  does  not 
have  to  be  BCP.  To  demonstrate  this,  they  pre¬ 
sented  a  sequential  implementation  (in  sequential 
implementations  the  logical  actions  are  executed 
sequentially,  without  overlapping)  with  O(log  w) 
space  complexity.  Later  it  was  proven  by  Cori 
and  Sopena  in  [CS90]  that  any  implementation 
for  w  writers  should  have  at  least  2w-  1  distinct 
labels.  They  also  devised  a  sequential  register 
with  exactly  2u;  -  1  labels  which  improved  the 
space  complexity  of  the  sequential  [LV90]  imple¬ 
mentation,  by  a  constant. 

These  works  leave  an  exponential  gap  between 
the  lower  and  upper  bounds  on  the  space  rom- 
plexity  of  atomic  register  implementation.  Vi^hile 
the  lower  bound  relates  only  to  the  combinato¬ 
rial  properties  of  keeping  track  of  the  last  value 
(and  therefore  holds  for  sequential  implementa¬ 
tions  as  well),  an  actual  concurrent  implementa¬ 
tion  should  also  deal  with  concurrency  problems; 
all  existing  implementations  required  /mear  space 
to  do  that.  The  significance  of  this  gap  is  further 
emphasized  when  one  takes  a  closer  look  at  the 
unbounded  implementation  of  Vitanyi  and  Awer 
buch  in  [VA86].  For  polynomial  length  executions 
the  space  complexity  of  this  implementation  is 
logarithmic.  A  linear  space  complexity  is  reached 
by  this  implementation  only  in  executions  of  ex¬ 
ponential  length.  In  other  words:  The  bounded 


protocols  supersede  the  unbounded  protocol  only 
in  exponentially  long  executions.  In  polynomial 
length  executions,  which  are  often  viewed  as  a 
better  model  for  real-life  situations,  an  exponen¬ 
tial  overhead  is  paid  for  the  theoretical  bounded¬ 
ness. 

The  natural  question  arising  here  is:  Is  this 
payment  necessary?  The  answer  is  negative  as 
we  show  by  presenting  two  bounded,  concurrent, 
label-based  implementations  for  atomic,  multi- 
writer,  multi-reader  register,  with  logarithmic 
space  complexity.  The  first  implementation  uses 
atomic,  (1,  n)-physical  registers.  The  second  im¬ 
plementation  uses  atomic,  (1,  l)-physical  regis¬ 
ters.  These  implementations  are  the  first  to  break 
the  linear  space  barrier,  for  atomic  multi-writer 
registers;  moreover,  by  the  lower  bound  proven 
by  Cori  and  Sopena  in  [CS90]  the  logarithmic 
space  complexity  is  optimal.  The  time  complex¬ 
ity  of  both  implementations  is  linear,  which  is 
obviously  optimal.  Both  implementations  are 
self-stabilizing:  Regardless  of  the  system’s  initial 
state,  it  eventually  reaches  a  legitimate  state  — 
a  state  which  can  be  reached  in  a  legally  initial¬ 
ized  system.  In  such  an  arbitrary  initial  state  the 
processors  may  be  in  arbitrary  states  (e.g.  in  the 
middle  of  some  logical  action)  and  the  physical 
registers  may  hold  arbitrary  values. 

To  represent  temporal  relations  among  proto¬ 
col  executions  we  use  precedence  graphs.  These 
graphs  which  were  introduced  by  Israeli  and  Li  in 
[IL87],  are  time-dependent  graphs  whose  nodes 
and  edges  at  any  given  time  are  determined  by 
the  labels  stored  in  the  processors’  registers  at 
that  time.  Label  precedes  £2  if  there  is  a  di¬ 
rected  edge  {£2,£i)  in  the  precedence  graph.  Fol¬ 
lowing  [1LV87]  we  use  dynamic  precedence  trees  in 
which  the  outdegree  of  every  node  is  at  most  1, 
each  label  points  to  at  most  one  other  preceding 
label.  At  any  given  time  the  precedence  graph  is  a 
forest  of  intrees  —  trees  whose  edges  are  directed 
towards  the  root.  For  each  individual  label  lb  the 
set  of  labels  whose  temporal  relationship  with  lb 
can  be  found  by  direct  comparison  includes  the 
labels  whose  edges  point  to  lb,  and  the  single  la¬ 
bel  to  which  lb's  edge  points.  Hence  our  protocols 
are  indeed  not  BCP.  Each  path  of  a  precedence 
intree  is  ordered  temporally  but  labels  on  distinct 
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paths  are  in  general  not  comparable.  The  paths 
of  any  precedence  intree  are  ordered  lexicograph¬ 
ically,  by  the  ids  of  (the  processors  which  gen¬ 
erated)  their  nodes.  The  most  preceded  path  in 
the  precedence  forest  is  called  the  frontal  branch 
of  the  forest.  The  frontal  branch  consists  of  nodes 
which  are  generated  by  recent  write  actions  and 
the  last  node  on  this  branch  is  the  last  serialized 
write  action. 

The  rest  of  this  paper  is  organized  as  follows: 
In  Section  2  we  explain  the  data  structure  used 
by  our  protocols  and  present  a  sequential  imple¬ 
mentation  which  serves  as  an  exposition  for  the 
ideas  which  are  later  used  in  the  concurrent  im¬ 
plementations.  The  (l,n)  implementation  is  pre¬ 
sented  in  Section  3.  The  (1, 1)  implementation  is 
briefly  sketched  in  Section  4,  Concluding  remarks 
are  brought  in  Section  5. 


2  The  Precedence  Trees 

The  three  implementations  presented  in  this  pa¬ 
per  are  label-based.  Temporal  relations  among 
logical  actions  and  among  the  labels  correspond¬ 
ing  to  these  logical  actions  are  represented  by 
use  of  precedence  graphs  which  were  originally 
proposed  by  [IL87].  These  are  time  dependent 
directed  graphs  whose  nodes  and  edges  are  en¬ 
coded  by  the  labels  generated  during  the  proces¬ 
sors’  actions.  The  semantics  of  the  precedence 
graph  is:  If  there  is  an  edge  from  label  ii  to  £2 
then  the  action  that  generates  £i  is  serialized  af¬ 
ter  the  action  that  generates  £2.  In  this  paper  we 
use  dynamic  precedence  trees  which  are  similar 
to  those  used  in  [ILV87].  In  this  section  we  de¬ 
scribe  the  structure  of  the  precedence  trees  used 
by  all  the  implementations,  and  outline  a  sim¬ 
ple  sequential  implementation  of  a  multi-writer, 
multi-reader,  register  using  single-writer,  multi¬ 
reader  registers.  The  sequential  implementation 
does  not  improve  upon  the  complexity  of  previ¬ 
ously  known  sequential  implementations.  It  is 
brought  here  as  an  exposition  for  the  ideas  which 
enable  the  concurrent  implementations. 


2.1  Data  Structure 

The  writers  of  the  implemented  logical  registers 
are  denoted  by  Wi ,  W2, . . . ,  Wyj.  Execution  num¬ 
ber  a  of  the  writer  protocol  by  W,  is  denoted  by 
Lf.  Each  individual  execution  of  the  writer  pro¬ 
tocol,  Lf,  generates  a  single  label  which  is  de¬ 
noted  by  i  and  a  are  called  the  id  and  the 
index  of  (and  of  £f),  respectively.  Label  is 
stored  in  the  register  of  W,,  at  the  end  of  L“. 

All  our  protocols  share  an  identical  structure  of 
the  labels  which  is  described  below:  Each  label 
encodes  a  node  and  a  potential  edge  emanating 
from  that  node,  where  the  outgoing  edge  is  di¬ 
rected  either  towards  the  label  itself  to  form  a  self 
loop,  or  towards  nodes  encoded  by  other  labels. 
For  convenience  we  identify  the  node  of  £f  with 
the  label  itself  and  denote  it  by  as  well.  The 
node  of  £“  is  specified  by  the  number  i,  which  is 
called  the  id  of  the  node,  and  by  its  address  which 
is  an  integer.  Since  the  id  of  all  nodes  of  W,  is 
i,  the  id  of  a  node  is  omitted  from  its  encoding 
in  £f,  hence  the  node  of  £f  is  encoded  by  the  ad¬ 
dress  field  which  is  denoted  by  £f  .address.  The 
edge  of  £f  is  encoded  by  the  edge  field  which  is  de¬ 
noted  by  £f.edge.  The  edge  field  stores  the  id  and 
address  of  the  node  to  which  the  potential  edge 
is  directed,  in  two  subfields,  which  are  denoted 
by  £i. edge. id  and  £i. edge. address,  respectively. 
The  potential  edge  emanating  from  £^  exists  in 
some  precedence  graph  G  if  for  some  label  in 
G,  it  holds  that  £'l.edge  =  {j,£^j.address).  In  case 
this  equality  does  not  hold  for  any  node  (label) 
in  G,  there  is  no  edge  outgoing  from  in  G.  Our 
protocols  make  sure  that  in  any  precedence  graph 
the  aforementioned  equality  holds  for  at  most  one 
label,  hence  each  label  £^  encodes  at  most  one 
outgoing  edge  which  is  denoted  by  e“  =  (£f,£j), 
where  £j  is  the  node  towards  which  e“  is  directed. 

We  require  that  £'^.edge.id  <  i,  thus  the  only 
type  of  cycles  that  we  permit  are  self  loops.  Con¬ 
sequently  the  nodes  (labels)  on  each  directed 
path  of  the  precedence  graph  are  ordered  in  in¬ 
creasing  order  of  their  ids  from  the  root  to  the 
leaves.  The  writers’  id's  lie  between  1  and  w.  In 
case  £°.edge.id  =  0  we  say  that  ef  exists  and  is 
directed  to  the  virtual  root  label £^.  Since  the  out- 
degree  of  each  node  is  at  most  1 ,  each  precedence 


graph  is  a  forest  of  labeled  intrees  —  trees  whose 
edges  are  directed  towards  the  root.  FVom  now 
on  we  assume  that  the  node  set  of  each  prece¬ 
dence  graph  includes  the  root  label  An  edge 
reflects  the  fact  that  £*  is  serialized  af¬ 
ter  Lf.  If  two  edges  and  enter  the 

same  node  if,  then  the  serialization  of  actions  Xj 
and  L%  cannot  be  determined  by  (the  precedence 
relation  induced  by)  the  edges  of  the  precedence 
tree.  In  this  case  we  adopt  the  common  conven¬ 
tion  and  serialize  these  actions  by  the  id  of  their 
labels  where  higher  ids  are  serialized  before  lower 
ids.  Later  we  deflne  and  use  the  history  graph 
of  an  execution  whose  nodes  are  all  labels  gen¬ 
erated  during  the  execution.  In  this  graph  it  is 
necessary  to  serialize  labels  with  equal  ids  whose 
edges  enter  the  same  node.  Since  the  actions  of 
each  individual  processor  are  temporally  ordered 
by  their  indices,  a  label  with  a  lower  index  is  se¬ 
rialized  before  a  label  with  a  higher  index.  These 
requirement  are  formally  accommodated  by  the 
following  lexicographic  ordering  of  labels:  Let  tf 
and  be  two  labels;  if  locally  precedes  i^j  if  either 
j  <  i  or  if  i  =  j  and  a  <  b.  Though  the  locally 
precedes  relation  is  defined  for  any  pair  of  dis¬ 
tinct  labels,  it  reflects  a  precedence  relation  only 
among  labels  whose  edges  enter  the  same  node  in 
the  precedence  graph. 

Let  G  be  some  precedence  graph.  The  frontal 
branch  of  G  is  a  path  which  is  defined  as  follows: 
The  first  node  in  the  frontal  branch  is  the  vir¬ 
tual  root  label  the  second  node  of  the  frontal 
branch  of  G  is  a  node  whose  edge  enters  the  root 
and  which  is  locally  preceded  by  all  other  nodes  in 
G  whose  edges  enter  ig.  In  general  if  o  is  a  prefix 
of  the  frontal  branch  of  G  whose  last  node  is  if 
then  the  next  node  in  the  frontal  branch  is  the 
label  whose  edge  enters  if  and  which  is  locally 
preceded  by  all  other  labels  whose  edges  enter  if. 
The  last  node  in  the  frontal  branch  of  G  (whose 
indegree  is  0),  is  called  the  last  node  of  G.  Let 
€i  =  {if,i%)  and  C2  =  (^j>^fc)  be  two  edges  in 
some  precedence  graph  G,  such  that  ei  belongs 
to  the  frontal  branch  of  G  (that  is  ^  locally  pre¬ 
cedes  if).  In  this  case  we  say  that  ci  excludes 
€2  from  the  frontal  branch  of  G.  A  pictorial  de¬ 
scription  of  a  precedence  tree  appears  in  Figure  1, 
where  the  edges  of  the  frontal  branch  appear  as 


solid  arrows  and  the  rest  of  the  edges  appeaur  as 
dotted  airrows.  The  basic  idea  in  all  the  imple- 
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Figure  1:  A  precedence  intree 

mentations  is:  At  any  time  t  the  labels  stored  in 
the  registers  encode  a  precedence  graph  whose  last 
node  belongs  to  the  last  write  action  completed. 

2.2  Sequential  Implementation 

Each  writer  in  the  sequential  implementation 
writes  in  a  (l,n)-register  which  can  be  read  by 
all  processors,  writers  and  readers.  (Since  the 
implementation  is  sequential  we  need  not  require 
that  the  registers  are  atomic.)  In  this  implemen¬ 
tation  readers  do  not  write.  The  current  labels 
at  time  t  are  the  labels  stored  in  the  processors’ 
registers  at  time  t.  The  collect  operation  consists 
of  reading  the  labels  of  all  writers  and  comput¬ 
ing  the  precedence  graph.  This  precedence  graph 
is  called  the  current  graph]  the  precedence  tree 
containing  the  root  label  ^  is  called  the  current 
tree.  The  nodes  of  the  current  graph  are  the  root 
and  all  the  current  labels,  that  is  w  -I- 1  nodes. 

Let  a  be  a  path  in  a  precedence  tree,  the  i- 
prefix  of  Q,  denoted  by  a/t,  is  the  prefix  of  q 
which  contains  all  nodes  whose  id  is  less  then  i. 
The  protocol  for  W,,  1  <  »  <  tn,  works  as  follows: 
>V,  collects  the  labels  of  all  writers,  computes  the 
current  tree  and  chooses  a  new  label  which  re¬ 
places  its  previous  active  label.  The  edge  of  the 
new  label  is  directed  towards  the  last  node  in  the 
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i-prefix  of  the  frontal  branch  of  the  current  tree. 
The  new  edge  excludes  the  rest  of  the  previous 
frontal  branch  from  the  new  frontal  branch.  At 
the  same  time  W,  invalidates  all  edges  which  are 
directed  towards  its  previous  label,  by  choosing  a 
new  address  which  is  not  the  address  of  the  edge 
field  of  any  writer.  The  reader  protocol  is  sim¬ 
ply  to  collect  the  current  tree  and  to  return  its 
last  node.  The  register  of  W,,  Ri,  is  initialized  to 
((i,  0),  0).  At  the  initial  state  all  edges  of  the  cur¬ 
rent  graph  are  self-loops  and  the  (initial)  current 
tree  contains  only  the  virtual  root  label  The 
initial  logical  value  is  the  value  corresponding  to 

which  can  be  chosen  freely  from  the  permitted 
values  of  the  logical  register. 

Correctness  of  the  sequential  implementation 
is  straight  forward  and  is  left  to  the  reader.  In 
order  to  allow  a  writer  to  invalidate  (at  most) 
in  —  1  edges  of  the  precedence  tree,  it  should  have 
at  least  w  possible  addresses.  Therefore  for  each 
writer,  the  size  of  the  edge  field  is  2  log  w,  the  size 
of  the  address  field  is  login,  and  the  size  of  the 
entire  label  is  3  log  in. 

3  Multi- Writer  out  of  Multi- 

Reader  Registers 

In  this  section  we  present  a  concurrent  implemen¬ 
tation  of  a  (in,  r)-atomic  register  using  physical 
(1, 7i)-atomic  registers.  In  this  implementation 
communication  is  again  one-sided,  readers  do  not 
write.  An  important  design  decision  in  this  im¬ 
plementation  is  to  serialize  all  logical  write  ac¬ 
tions  independently  of  the  scheduling  of  the  logi¬ 
cal  read  actions.  Serialization  of  the  logical  read 
actions  is  then  done  in  accordance  with  the  se¬ 
rialization  of  the  write  actions.  Therefore  this 
section  is  divided  into  two  subsections:  The  first 
subsection  presents  the  writer  protocol  and  the 
serialization  scheme  for  the  logical  write  actions, 
and  continues  with  a  proof  of  the  correctness  of 
this  serialization  scheme.  The  second  subsection 
presents  the  reader  protocol  and  the  serialization 
scheme  for  the  logical  read  actions,  and  then  pro¬ 
ceeds  to  prove  the  correctness  of  the  serialization 
scheme  for  the  read  actions  which  implies  the  cor¬ 
rectness  of  the  entire  implementation. 


3.1  The  Writer  Protocol 

3.1.1  Description 

The  writer  protocol  is  obtained  by  adjusting  the 
sequential  writer  protocol  to  the  concurrent  en¬ 
vironment  while  keeping  its  basic  ideas  and  data 
structure.  The  structure  of  the  labels  in  this  im¬ 
plementation  is  identical  to  the  structure  of  la¬ 
bels  in  the  sequential  implementation  and  once 
more  they  encode  a  labeled  intree  as  a  prece¬ 
dence  graph  whose  last  node  corresponds  to  the 
last  completed  write  action.  In  a  concurrent  en¬ 
vironment  however,  a  writer  cannot  simply  re¬ 
place  its  current  label  by  a  new  label  which  be¬ 
comes  the  last  in  the  current  tree.  If  the  protocol 
were  to  work  in  this  fashion  a  new  label  might 
be  pointed  at  by  a  label  of  another  write  action 
which  is  completed  at  an  earlier  time,  immedi¬ 
ately  when  it  is  written.  In  this  way  the  prece¬ 
dence  graph  may  not  reflect  the  temporal  order 
of  actions.  This  problem  does  not  rise  in  the 
sequential  implementation  which  supports  only 
non-overlapping  logical  actions.  To  overcome  this 
problem  the  coordination  field  of  a  writer’s  reg¬ 
ister  in  this  implementation  consists  of  two  la¬ 
bels  called  new  and  current.  During  X“,  the  new 
field  stores  a  tentative  choice  for  the  next  la¬ 
bel  of  W  Inal  is  written  in  the  current 

field  in  the  last  physical  action  of  T?.  The  details 
of  this  mechanism  are  explained  below.  Accord¬ 
ingly,  the  structure  of  Ri,  the  register  of  W,  is: 
Ri  =  {value,  new,  current),  and  the  space  com¬ 
plexity  is  equal  to  the  size  of  two  labels  rather 
than  one.  Since  the  value  field  is  not  used  by  the 
protocol  we  assume  from  now  on  that  each  label 
is  written  with  the  corresponding  value  and  omit 
it  from  the  protocol’s  description. 

Similar  to  the  sequential  implementation 
the  register  of  W,,  Ri,  is  initialized  to 
((i,0),0,(i,0),0).  That  is,  at  the  initial  state  all 
edges  of  the  current  graph  are  self  loops,  the  (ini¬ 
tial)  current  tree  contains  only  the  root  label  ^ 
and  the  initial  logical  value  is  the  value  corre¬ 
sponding  to  ^Q,  which  can  be  chosen  freely  from 
the  set  of  till  permitted  values.  Let  £'■  and 
be  two  labels,  the  fact  that  if  and  are  actu¬ 
ally  the  same  label,  i.e.  i  =  j  and  a  =  f>,  is 
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denoted  by  ^  s  £^.  Since  the  relation  =  is  not 
computable  by  the  processors,  the  protocol  uses 
the  relation  £j  2;  ij  which  denotes  the  fact  that 
tj.edge  ~  i^.edge  and  t^.address  =  ij.address. 

begin 

Collect  G, 

last  :=  id  of  the  last  node  of  BJi 
new.address  :=  select  (GJ 
new-edge  :=  {last,  ^^^f.addr ess) 
new  :=  {new.edge,  newjaddress) 

Ri  :=  write  {new,  current) 

^^last  •“  read  Riast-current 

if 

then 

/*  connect:  Direct  ef  towards  last  */ 
current  :=  new 

else 

/*  loop:  Direct  e“  toward  if  */ 
current.edge  :=  {i,  newjaddress) 
current.address  :=  newjaddress 

endif 

Ri  :=  write  {new,  current) 

end 


Figure  2:  The  protocol  for  W, 

The  writer  protocol  appears  in  Figure  2.  We 
now  describe  the  protocol  assuming  that  £“  is 
executed:  First  W,  executes  the  procedure  col¬ 
lect  in  which  it  reads  the  registers  of  all  other 
writers  and  computes  a  graph,  denoted  by  G“. 
The  labels  of  G“  are  those  read  from  the  cur¬ 
rent  fields  of  all  registers.  Execution  of  collect 
takes  w  atomic  physical  actions  which  are  de¬ 
noted  by  r“[l]  ...  r“[u;].  During  collection  of  G“ 
some  writers  may  change  their  labels,  therefore 
G“  is  not  necessarily  equal  to  the  current  graph 
at  any  time  during  the  collection  of  G“.  The 
frontal  branch  of  G“  is  denoted  by  Bf.  The  node 
with  the  maximal  id  in  Bf/i  is  the  last  node  of 
Bf/i.  Let  be  the  last  node  of  Bf/i.  Af¬ 
ter  G“  is  collected,  W,  chooses  a  tentative  new 
label  whose  edge  is  directed  towards  and 
whose  auldress  is  obtained  by  the  function  select; 
this  function  ensures  that  the  new  label  is  not 


the  head  of  any  (existing  or  potential)  edge  en¬ 
coded  by  any  current  label  or  new  label  read 
during  Lf.  In  this  way  all  edges  entering  tf~^ 
are  invadidated.  In  addition  the  function  select 
ensures  that  new. address  ^  if  .address  even 
if  no  other  edge  is  directed  towards  if~^;  thus 
a  writer  never  uses  the  same  address  twice  in  a 
row.  The  chosen  label  is  declared  by  W,-  by  writ¬ 
ing  it  into  the  new  field  of  jR,  while  the  current 
label  is  not  changed.  The  declaring  write  action 
is  denoted  by  df.  Following  df,  VV,  rereads  the 
register  of  W/aat,  the  writer  whose  label  is  last 
in  Bf/i.  This  second  read  action  is  denoted  by 
f“[/ast].  The  label  read  in  action  f“[/ast]  is  called 
the  target  label  of  Lf.  The  logical  write  action  Lf 
is  concluded  by  a  final  physical  write  action  which 
is  denoted  by  wf.  In  this  action  replaces  its 
current  label  as  follows:  Let  ll^g^  and  n^l^^^  be 
the  two  labels  read  by  W,  in  actions  rf  [fast]  and 
f“[/ast]  respectively.  If  ~  ^/ast 
label  is  assigned  to  current,  in  this  case  we  say 
that  Lf  connects.  If  however  ^last 

current  is  chosen  such  that  its  edge  is  a  self-loop 
directed  towards  if  itself.  In  this  case  we  say  that 
Lf  loops. 

3.1.2  Serialization  Scheme  for  Logical 
Write  Actions 

The  atomicity  of  an  implementation  is  proved  by 
showing  that  for  every  physical  sequential  execu¬ 
tion,  the  induced  logical  execution  can  be  seri¬ 
alized.  A  sequential  execution  is  entirely  deter¬ 
mined  by  its  schedule;  the  complete  asynchrony 
of  the  system  means  that  the  only  points  in  time 
which  can  be  used  to  serialize  a  logical  action, 
are  the  occurrence  times  of  the  physical  actions, 
and  whenever  some  flexibility  is  possible,  the  in¬ 
tervals  between  these  occurrence  times.  For  this 
reason  we  sometimes  use  the  name  of  an  ac¬ 
tion  to  denote  its  occurrence  time.  Whenever 
we  say  that  some  property  p  holds  at  action  a 
we  mean  that  p  holds  right  after  the  occurrence 
time  of  a.  In  the  following  subsections  we  fix 
some  arbitrary  sequential  physical  execution  with 
respect  to  which  all  definitions  and  proofs  are  be¬ 
ing  made,  since  this  fixed  execution  is  arbitrary, 
the  results  hold  for  every  system  execution  of  the 
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implementation.  The  logical  write  actions  are  se¬ 
rialized  by  an  explicit  assignment  of  serialization 
time  for  each  action.  This  assignment  is  done  us¬ 
ing  a  History  graph  —  a  precedence  graph  whose 
structure  reflects  the  execution  of  the  system. 
The  history  graph  —  which  is  not  computable 
by  the  processors  —  plays  a  key-role  in  the  cor¬ 
rectness  proof  of  our  protocols. 

Definition  1:  Let  E  be  an  execution  of  the 
system.  The  History  graph  of  E,  H,  is  a  time 
dependent  graph  which  is  constructed  incremen¬ 
tally  during  the  execution,  as  follows: 

1.  H^  is  the  History  graph  at  time  0  (before  the 
execution  begins).  It  contains  only  the  root 
node  — 

2.  Let  be  an  arbitrary  logical  write  action. 
Let  t  be  the  time  when  wf  occurs,  the  time 
before  t  and  after  any  other  physical  action 
which  precedes  u;“  is  denoted  by  by  t~ .  H^  is 
the  graph  obtained  from  H^~  by  adding  the 
node,  and  the  directed  edge  c“  as  follows: 
If  £“  connects  then  ef  is  directed  towards  the 
target  label  of  L“,  otherwise  (L®  loops)  c“  is 
a  self  loop.  According  to  our  convention  of 
using  an  action’s  name  to  denote  its  occur¬ 
rence  time,  H*  is  also  denoted  by  H'^>. 

At  any  time  t,  the  history  graph  H*  is  a  prece¬ 
dence  forest  which  consists  of  a  single  precedence 
tree  and  some  disjoint  self-loops.  The  frontal 
branch  of  /fMs  denoted  by  B\j.  In  order  to  seri¬ 
alize  the  logical  actions  we  first  partition  the  set 
of  logical  write  actions  into  two  subsets,  good  and 
bad,  where  a  good  write  action  is  a  write  action 
whose  label  is  last  in  at  the  time  t  that  it 
joins  the  history  graph. 

Definition  2:  Let  L®  be  an  arbitrary  logical 
write  action  Action.  Z-“  is  good  if  £f  is  last  in 
Bff  .  A  logical  write  action  which  is  not  good,  is 
bad. 

Obviously  if  L®  loops  then  it  is  bad,  but  there 
are  many  cases  in  which  L®  connects  and  it  is  also 


bad.  Using  the  partition  of  logical  write  actions 
to  good  and  bad  we  assign  a  serialization  time  to 
every  logical  write  action.  The  serialization  time 
establishes  a  total  order  on  the  set  of  logical  write 
actions: 

Definition  3:  Let  t  be  the  occurrence  time  of 
wf.  Define  the  serialization  time  of  Lf  as  follows: 

1.  If  I>®  is  good,  that  is,  if  is  the  last  node  of 
Bjj,  then  Lf  is  serialized  at  t. 

2.  If  Lf  is  bad  and  i^  is  the  last  node  of  Bjj 
then  Lf  is  serialized  before  and  after  any 
other  physical  action  which  precedes  tn*.  In 
this  case  we  say  that  Lf  is  serialized  by  L^ 
In  case  two  bad  write  actions,  Lf  and 
are  serialized  by  the  same  good  write  action 
L\,  Lf  and  Lj  are  serialized  by  their  ids:  The 
action  with  the  lowest  id  first,  and  the  action 
with  higher  id  second. 

Under  the  defined  serialization  time,  at  the  oc¬ 
currence  time  of  every  physical  action,  t,  the  last 
node  in  B\j  is  the  most  recent  write  action,  that 
is  the  value  of  the  logical  register. 

3.1.3  Correctness  of  the  Serialization 
Scheme  for  Logical  Write  Actions 

In  the  correctness  proof  we  have  to  prove  that 
the  serialization  time  satisfies  the  serialization  re¬ 
quirements  for  atomic  registers.  In  Theorem  9 
we  prove  that  every  logical  write  action  is  serial¬ 
ized  within  its  execution  interval.  In  Theorem  11 
we  prove  that  all  graphs  collected  by  the  writers 
are  precedence  graphs.  We  start  the  correctness 
proof  with  some  technical  lemmas. 

Lemma  1;  Let  if  be  a  node  in  H^.  If  if  ^  B]f, 
then  for  any  t'  >  t,  if  ^  B*^. 

Let  a  and  b  be  two  physical  atomic  actions;  the 
fact  that  a  occurs  before  b  is  denoted  by  a  — ►  6. 

Lemma  2;  If  {if,i’j),  i  /  j,  is  an  edge  in  H 
then  Wj  occurs  before  wf  (i.e.  Wj  — ►  tn“). 
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Lemma  3:  If  €  B\j  then  Lf  is  good. 
Lemma  4: 

(a)  The  sequence  of  node  ids  along  any  path  of 
the  history  graph  from  the  leaf  towards  the 
root  is  strictly  decreasing. 

(b)  Every  path  of  the  history  graph  contains  at 
most  one  label  of  every  writer. 

Due  to  concurrency  it  is  not  guaranteed  that 
the  edges  of  a  graph  collected  by  any  writer  or 
reader  belong  to  the  history  graph.  A  weaker 
relationship  between  the  edges  of  the  collected 
graphs  and  the  edges  of  the  history  graph  is  de¬ 
picted  in  the  next  lemma: 

Lemma  5:  If  ^j)  is  an  edge  in  then  there 
exists  an  integer  r,  r  >  0,  such  that  (^“,^5'*’’^)  is 
an  edge  in  H. 

The  next  lemma  proves  that  if  belongs  to 
the  frontal  branch  of  the  history  graph  then  it  is 
the  current  label  of  Wj. 

Lemma  6:  If  €  Bjf  for  some  time  t,  then 
is  the  current  label  of  W,-  at  t. 

The  next  corollary  forms  a  tighter  relationship 
between  edges  of  G  whose  head  belongs  to  the 
frontal  branch  of  the  history  graph  and  the  cor¬ 
responding  edges  of  H. 

Corollary  7:  Let  be  an  edge  in  G^.  If 

i'j  e  B'^^^  then  is  an  edge  in 

The  fact  that  every  label  e  B\j  is  in  5“ 
and  every  label  t\  6  5“  is  in  B\j,  is  denoted  by 
B\j  =  fif.  In  the  next  lemma  and  in  the  theo¬ 
rem  that  follows,  we  prove  that  the  serialization 
time  of  each  logical  write  action  lies  within  its 
execution  interval. 

Lemma  8;  Let  to  be  the  occurrence  time  of 
rfL/].  If  at  t  >  to  B^fjl{j  +  1)  ^  BtHi  ■¥  1), 
then  there  exists  a  good  logical  action  for 
some  k  <  j,  such  that  occurs  within  the  time 
interval  starting  at  r“[A:]  and  ending  at  t. 


Theorem  9:  Every  logical  write  action  is  seri¬ 
alized  within  its  execution  interval. 

The  fact  that  Z,“  is  serialized  before  Lj  is  de¬ 
noted  by  Zf  =>  Lj.  The  next  lemma  proves  that 
the  history  graph  is  a  precedence  graph  with  re¬ 
spect  to  the  relation  =>.  Since  the  history  graph 
is  not  computable  by  the  processors,  the  signif¬ 
icance  of  this  lemma  is  reflected  in  the  theorem 
that  follows  in  which  it  is  proved  that  the  graphs 
collected  by  the  processors  are  also  precedence 
graphs  with  respect  to  the  relation  =». 

Lemma  10:  If  i  >  j,  is  an  edge  in  H 

then  i'j  ^  L“. 

Theorem  11:  If  {if,  i)),  i  >  j,  is  an  edge  in  G% 
then  L^j  =>  Lf. 

3.2  The  Reader  Protocol 
3.2.1  Description 

Like  the  writer  protocol,  the  reader  protocol  is 
obtained  by  adjusting  the  sequential  reader  pro¬ 
tocol  to  the  concurrent  environment.  Though  the 
basic  idea  in  this  implementation  is  to  keep  a 
precedence  tree  whose  last  node  is  the  last  value 
written  to  the  logical  register,  it  is  not  possible 
to  just  read  the  current  tree  and  return  its  last 
node:  Due  to  concurrency,  the  current  tree  and 
its  last  node  may  change  during  the  execution  of 
the  collect  procedure  by  the  reader.  In  particular 
a  label  of  an  action  which  should  not  be  returned 
by  a  reader,  may  appear  as  last  in  its  collected 
tree.  This  happens  when  some  concurrent  write 
actions  cause  the  reader  to  see  some  branches  of 
the  tree  as  “hanging  in  the  air”.  Therefore  a 
single  collection  is  not  sufficient.  Our  protocol 
collects  three  forests  and  analyzes  the  differences 
among  these  forests  to  determine  the  returned  la¬ 
bel.  The  three  forests  are  denoted  by  G,  G,  and 
G.  Forest  G  is  collected  in  reverse  order  —  from 
Rw  down  to  Ri,  therefore  most  of  the  lemmas 
proved  in  the  previous  section  do  not  hold  for  G- 
The  analysis  of  the  ^hree  graphs  does  not  yield 
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an  accurate  description  of  the  current  graph,  but 
rather  enables  the  reader  to  identify  a  label  which 
satisfies  the  requirements  for  the  reader  protocol 
as  follows:  The  identified  label  is  either  last  in  the 
history  graph  when  the  logical  read  action  starts, 
and  hence  it  is  the  last  logical  write  action  that 
is  serialized  before  the  read  action  starts,  or  the 
identified  label  is  generated  by  a  logical  write  ac¬ 
tion  serialized  within  the  execution  interval  of  the 
logical  read  action.  In  this  case  the  logical  read 
is  serialized  immediately  after  the  logical  write. 
The  physical  actions,  executed  by  TJu,  during  the 
reader  protocol  are  denoted  by:  r„[l] ...  in 
which  G  is  collected,  [w]  ...  [1],  in  which 

G  is  collected,  and  f^[l]  ...  in  which  G  is 

collected.  The  notation  Bf  C  Bj  is  used  when 
for  each  €  5“,  there  is  a  label  if  €  Bj  such 
that  ^  if  ■  The  notation  Bf  Bj  is  used  when 
Bf  C  Bj  and  Bj  C  Bf.  The  code  of  the  reader 
protocol  appears  in  Figure  3. 


begin 

Collect  G^;  Collect  Collect  Gu 
last  :=  id  of  the  last  node  of 
if  ~  then 

i  :=  minj{(f5(€  G„)  jk  ij(€Gu))  and 

(e^j  ^  G„))} 

return  if 

elseif  (B^  c^Bu—  B^)  then 
return  the  label  of  last  in  B^ 
elseif  (5„  9^  Bu)  and  (5„  C  G^)  then 
return  the  label  of  last  in  B^ 
elseif  (By  ^  Bu)  and  {B^  <(.  Gu)  then 
z  :=  min^(f,ne  BJ  9^  if(e  G„)) 

return  if 

endif 

end 


Figure  3:  The  protocol  for  TZu 


3.2.2  Serialization  Scheme  for  Logical 
Read  Actions 

Throughout  this  paper  5“  denotes  the  a-th  exe¬ 
cution  of  the  read  protocol  by  Tlu-  The  serializa¬ 
tion  time  of  any  logical  read  action  is  determined 
by  the  serialization  time  of  the  logical  write  ac¬ 
tion  whose  value  is  returned  by  the  read  action, 
according  to  the  following  proposition: 

Proposition  12:  If  for  any  logical  read  action 
Sf  the  returned  label  ij  satisfies  one  of  the  fol¬ 
lowing  conditions: 

1.  L^is  the  logical  write  action  that  is  serialized 
last  before  Sf. 

2.  Z/j  is  serialized  within  the  execution  interval 

of5^ 

then  the  implementation  is  atomic. 

To  prove  the  proposition  correct  we  have  to 
show  that  if  for  every  action  in  some  execution 
E,  one  of  these  conditions  holds,  then  the  E  is 
serializable.  This  is  proven  by  the  following  seri¬ 
alization  scheme: 

Definition  4:  Let  if  be  the  label  returned  by 
5“.  Denote  by  t,  and  tg  the  occurrence  time 
of  r“[l]  and  respectively.  The  serialization 
time  of  5“  is  defined  as  follows: 

1.  \i  Lf  is  not  serialized  within  the  execution 
interval  of  5u  then  5“  is  serialized  at  t,. 

2.  If  L\  is  serialized  at  time  t  which  lies  within 
the  execution  interval  of  5“  then  5^  is  seri¬ 
alized  at  where  t^  denotes  the  time  im¬ 
mediately  after  /. 

3.2.3  Correctness  of  the  Implementation 

The  correctness  of  the  serialization  scheme  for 
logical  read  actions,  and  the  correctness  of  the 
entire  implementation,  is  based  on  Proposition  12 
and  on  the  following  theorem: 
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Theorem  IS:  Let  (t„te)  be  the  execution  in¬ 
terval  of  5u,  let  t  be  the  label  returned  by  and 
let  L  be  the  logical  write  that  produced  the  label 
t.  L  satisfies  one  of  the  following  two  claims: 

1.  i  is  the  last  logical  write  action  serialized 
before  tg  (  and  hence  i  is  last  in  or 

2.  L  is  serialized  within  the  interval  (ts,te)- 

4  Multi- Writer  out  of  Single- 

Reader  Registers 

In  this  section  we  briefly  sketch  an  implementa¬ 
tion  in  which  the  physical  registers  are  atomic 
(1,  l)-registers.  Though  this  implementation  fol¬ 
lows  the  lines  of  the  previous  implementation,  the 
use  of  inferior  physical  registers  requires  some  ad¬ 
justments.  First,  every  single  physical  action  in 
the  previous  implementation  is  now  replaced  by 
n  physical  actions  (for  example  wf  is  replaced 
by  tn“[l]  Second,  in  this  implementa¬ 

tion  communication  is  two  sided:  Every  pair  of 
processors.  Pi  and  Pj,  (regardless  whether  they 
are  writers  or  readers)  communicate  via  a  pair  of 
atomic  (1,  l)-register8.  The  writer  protocol  fol¬ 
lows  the  lines  of  the  protocol  in  the  (l,n)  im¬ 
plementations  with  few  minor  changes.  In  the 
reader  protocol  we  take  advantage  of  the  use  of 
(1,1)  registers  to  decide  upon  the  returned  value 
after  collecting  a  single  graph.  In  an  implementa¬ 
tion  based  on  (1, 1)  registers,  a  reader  can  identify 
executions  of  logical  actions  whose  execution  in¬ 
terval  is  enclosed  within  the  execution  interval  of 
the  reader,  without  enlarging  the  label-size  over 
O(logn)  bits.  This  is  done  by  using  well  known 
hand-shake  mechanism.  The  main  idea  behind 
the  reader  protocol  is  as  follows:  If  the  reader 
does  not  see  any  enclosed  label  during  the  col¬ 
lection  of  its  graph  G,  then  the  frontal  branch  of 
the  history  graph  before  the  collection  starts  is  a 
subgraph  of  G.  Therefore  if  there  exists  an  en¬ 
closed  label  then  the  reader  returns  it;  otherwise 
the  reader  returns  the  last  label  of  G. 

The  serialization  time  of  all  write  actions  is  de¬ 
termined  once  more  using  the  history  graph.  The 
nodes  of  the  history  graph  are  labels  of  writers 


and  (unlike  the  previous  implementation)  of  read¬ 
ers.  The  outgoing  edge  from  in  if,  cf,  is  deter¬ 
mined  in  the  same  way  it  is  done  in  the  previous 
implementation.  The  time  joins  the  history 
graph  always  falls  between  wf[l]  and  wf[n]:  Let 
t  be  the  hrst  time  that  some  label  joins  the  his¬ 
tory  graph  with  outgoing  edge  directed  towards 
If  Wi[n]  occurs  after  t  then  (f  joins  H  a.t  t~; 
otherwise  joins  H  at  the  occurrence  time  of 
u;“[n].  Let  be  a  logical  write  action  and  let  t 
be  the  time  if  joins  the  history  graph.  If  Lf  is 
good  (that  is  last  in  Bjf)  it  is  serialized  at  t.  If 
Lf  is  bad  it  is  serialized  by  the  label  that  was  last 
in  Bjf/{w  -I- 1). 

The  serialization  time  of  a  logical  read  action 
Sfi  is  determined  by  the  write  action,  Lf,  whose 
value  is  returned  by  5“.  It  can  be  shown  that  Lf 
is  either  serialized  within  the  execution  interval 
of  5“,  or  that  it  is  the  last  write  action  serialized 
before  5“  begins.  In  the  first  case  5“  is  serialized 
just  after  Lf.  In  the  second  case  5“  is  serialized 
at  the  beginning  of  its  execution  interval. 


5  Concluding  Remarks 

We  have  presented  two  implementations  of  a 
multi-reader,  multi-writer,  atomic  register.  Both 
implementations  use  a  novel  method  of  dynamic 
precedence  trees  in  which  only  partial  precedence 
information  is  represented  and  therefore  they  are 
not  BCP.  Both  implementations  are  optimal  with 
respect  to  the  most  important  complexity  cri¬ 
teria:  They  have  logarithmic  space  complexity 
and  linear  time  complexity.  Communication  in 
the  multi-reader  registers  based  implementation 
is  one  sided,  only  writers  execute  physical  write 
actions.  In  the  single-writer  register  based  im¬ 
plementation  communication  is  two-sided.  Re¬ 
cently  it  was  proved  by  Israeli,  Tromp  and  Vi- 
tanyi  in  [ITV92]  that  there  exists  no  such  imple¬ 
mentation  with  one-sided  communication.  The 
existence  of  an  implementation  which  is  based 
on  (1,1)  atomic  registers  with  label-size  which 
depends  only  on  the  number  of  writers  remains 
open. 
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Tolerating  Linear  Number  of  Faults  in  Networks  of 

Bounded  Degree 

Eli  Upfal  * 


Abstract 

In  [7],  Dwork  et  al.  proposed  a  new 
paradigm  for  fault  tolerant  distributed  com¬ 
puting  termed  almost  everywhere  agreement. 
While  all  other  fault  tolerance  paradigms  re¬ 
quire  networks  of  high  connectivity  to  tol¬ 
erate  substantial  number  of  faults,  it  was 
shown  in  [7]  that  the  new  paradigm  can  be 
achieved  even  on  bounded  degree  networks, 
as  long  as  the  number  of  faults  is  bounded 
by  0{n/  logn),  where  n  is  the  size  of  the  net¬ 
work. 

A  major  problem  that  was  left  open  in  [7] 
is  whether  almost  everywhere  agreement  can 
be  achieved  on  bounded  degree  networks  in 
the  presence  of  up  to  0{n)  faulty  nodes  (pro¬ 
cessors).  In  this  work  we  answer  this  ques¬ 
tion  in  the  affirmative.  As  in  [7],  our  solu¬ 
tion  is  bcised  on  a  general  technique  for  sim¬ 
ulating  on  a  bounded  degree  network  an  al¬ 
gorithm  designed  for  the  complete  network. 
Each  communication  round  of  the  complete 
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network  protocol  is  simulated  by  a  logarith¬ 
mic  number  of  communication  rounds,  and 
with  a  polynomial  number  of  messages. 

1  Introduction 

Achieving  processor  cooperation  in  the  pres¬ 
ence  of  faults  is  a  major  problem  in  dis¬ 
tributed  systems.  Consider  a  network  in 
which  each  node  is  a  processor  and  each  edge 
is  a  communication  link.  Popular  paradigms 
such  as  Byzantine  agreement  require  ft(<) 
connectivity  in  the  communication  network  in 
order  to  tolerate  t  faults  [5, 9].  A  simple  corol¬ 
lary  of  this  result  is  that  a  system  can  reach 
agreement  in  the  presence  of  t  faulty  nodes, 
only  if  every  processor  is  directly  connected 
to  at  least  0{t)  others.  Such  high  connectiv¬ 
ity,  while  feasible  in  a  small  system,  cannot 
be  implemented  at  reasonable  cost  in  a  large 
system. 

As  technology  improves,  increasingly  large 
distributed  systems  and  parallel  computers 
will  be  constructed.  In  any  forthcoming  tech¬ 
nology,  the  number  of  faulty  processors  in  a 
given  system  will  grow  with  the  size  of  the 
system,  while  the  degree  of  the  interconnec¬ 
tion  network  will,  for  all  practical  purposes, 
remain  fixed. 

Despite  these  negative  observations,  dis¬ 
tributed  systems  are  widely  used  and  paral¬ 
lel  computf'rs  are  being  built.  This  suggests 
that  the  correctness  conditions  for  Byzantine 
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agreement  are  too  stringent  to  reflect  practi¬ 
cal  situations.  In  particular,  Byzantine  agree¬ 
ment  guarantees  coordination  among  all  cor¬ 
rect  processors,  omitting  only  the  t  faulty  pro¬ 
cessors.  In  many  situations  it  may  suffice  to 
guarantee  agreement  among  all  but  0{t)  pro¬ 
cessors.  In  other  situations  a  simple  major¬ 
ity  consensus  may  suffice.  Similarly,  in  clock 
synchronization,  or  in  firing  squad  synchro¬ 
nization,  it  may  suffice  for  a  vast  majority  of 
the  correct  processors  to  be  synchronized. 

Motivated  by  the  need  to  run  fault  toler¬ 
ant  computing  on  sparse  networks,  Dwork  et 
al.  [7]  introduced  the  notion  of  almost  every¬ 
where  agreement  (denoted  a. e.- agreement),  in 
which  all  but  a  small  number  of  the  correct 
processors  must  choose  a  common  decision 
value.  Dwork  et  al.  showed  that  by  relaxing 
the  correctness  condition  from  agreement  be¬ 
tween  all  the  non-faulty  processors,  to  agree¬ 
ment  between  almost  all  the  non-faulty  pro¬ 
cessors,  one  can  eliminate  the  costly  connec¬ 
tivity  requirement.  In  particular  they  show 
that  there  are  bounded  degree  networks  of 
n  +  0{t)  processors  that  guarantee  agreement 
among  n  correct  processors  in  the  presence 
of  up  to  t  faults.  Further  works  by  Berman 
and  Garay  [3,  4]  improve  the  efficiency  of  the 
protocols  for  achieving  the  distributed  agree¬ 
ment.  The  a.e.-agreement  paradigm  admits 
deterministic  solutions  in  networks  of  small 
constant  degree  to  such  fundamental  prob¬ 
lems  as  atomic  broadcast,  Byzantine  agree¬ 
ment,  and  clock  synchronization. 

More  precisely,  a  protocol  P  is  said  to 
achieve  /(t)-agreement  if  in  every  execution 
of  P  in  which  at  most  I  processors  fail,  at 
least  n  —  f{t)  non-faulty  processors  eventually 
decide  on  a  common  value.  Moreover,  if  all 
the  correct  processors  share  the  same  initial 
value,  then  that  must  be  the  value  chosen. 
Note  that  the  traditional  Byzantine  agree¬ 
ment  problem  is  just  t-agreement. 

A  protocol  P  achieves  a.e.-agreement  in  the 
presence  of  t  faulty  nodes,  if  it  achieves  f{t)- 


agreement  for  some  f{t)  <  fit,  where  ft  is  a. 
constant  that  is  independent  of  t,  and  of  the 
size  of  the  network. 

The  main  result  in  [7]  is  a  bounded  de¬ 
gree  network  and  a  communication  protocol 
for  that  network  that  achieves  a.e.-agreement 
in  the  presence  of  up  to  0{n/  logn)  faults. 
The  major  problem  left  open  in  [7]  is  whether 
a.e.- agreement  can  be  achieved  on  bounded 
degree  networks  in  the  presence  of  more  than 
n/logn  faults.  Note  that  this  problem  is  sig¬ 
nificantly  harder.  In  bounded  degree  net¬ 
works  the  distance  between  most  pairs  of 
nodes  is  D(logn).  Thus,  in  the  presence  of 
more  than  n/logn  faulty  nodes,  most  com¬ 
munication  paths  between  most  pairs  of  pro¬ 
cessors  include  at  least  one  faulty  node. 

In  this  work  we  show  that  the  above 
difficulty  can  be  overcome  and  that  there 
are  bounded  degree  networks  for  which  a.e.- 
agreement  is  achievable  in  the  presence  of 
up  to  0(n)  faulty  processors.  Our  solution 
is  based  on  special  resilient  properties  of  ex¬ 
pander  graphs.  To  simplify  the  presentation 
no  attempt  is  made  here  to  compute  the  best 
constants.  Our  goal  is  to  demonstrate  the 
rather  surprising  fact,  that  a.e.-agreement  in 
the  presence  of  up  to  a  linear  number  of  faults 
is  feasible  on  some  bounded  degree  networks. 

We  give  an  explicit  construction  of  an  n 
node  bounded  degree  network  G,  and  a  com¬ 
munication  protocol  between  pairs  of  nodes 
in  G.  We  show  that  for  any  set  T  of  faulty 
nodes,  |T|  <  an,  the  communication  pro¬ 
tocol  guarantees  reliable  communication  be¬ 
tween  all  pairs  of  processors  in  a  set  P{T) 
of  at  least  n  —  fit  non-faulty  processors  in 
G  (q  and  fi  are  constants  independent  of  n 
and  t).  The  communication  protocol  requires 
O(logn)  communication  rounds  and  a  poly¬ 
nomial  number  of  messages.  Given  any  pro¬ 
tocol  designed  for  a  complete  network,  it  can 
be  simulated  by  our  communication  protocol 
on  the  bounded  degree  network  G,  achieving 
agreement  between  a  set  of  at  least  n  —  fit 
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non-faulty  nodes  in  the  presence  of  t  faulty 
nodes. 

Our  model  of  computation  is  identical  to 
that  commonly  used  in  the  Byzantine  liter¬ 
ature.  Specifically,  each  processor  can  be 
thought  of  as  a  (possibly  infinite)  state  ma¬ 
chine  with  specific  registers  for  communica¬ 
tion  with  the  outside  world.  The  proces¬ 
sors  communicate  by  means  of  point-to-point 
links,  which  are  assumed  to  be  completely  re¬ 
liable.  The  entire  system  is  synchronous,  and 
can  be  thought  of  as  controlled  by  a  com¬ 
mon  clock.  At  each  pulse  of  the  common 
clock  a  processor  may  send  a  message  on  each 
of  its  incident  communication  links  (possibly 
different  messages  on  different  links).  Mes¬ 
sages  sent  at  one  clock  pulse  are  delivered  be¬ 
fore  the  next  pulse.  Note  that  this  model 
counts  communication  rounds,  and  ignores 
the  local  computation  in  the  nodes.  An  in¬ 
triguing  open  problem  is  to  construct  an  effi¬ 
cient  algorithm  for  the  computation  that  each 
node  performs  in  the  process  of  achieving  a.e- 
agreement.  (The  local  computation  required 
by  our  solution  is  super-polynomial  in  n.) 

Byzantine  agreement  on  bounded  degree 
networks  in  the  presence  of  random  faults  has 
been  studied  by  M.  Ben-Or  and  D.  Goldriech 
[2,  8].  They  presented  a  network  and  a  poly¬ 
nomial  time  agreement  protocol  for  that  net¬ 
work  that  achieves  a.e-agreements  with  high 
probability  in  the  presence  of  linear  number  of 
random  faults  in  the  model  where  processors 
fail  independently  with  some  fixed  constant 
probability. 

2  Combinatorial  Charac¬ 
terization  of  Resilient 
Networks 

Dwork  et  al.  [7]  derived  the  following  com¬ 
binatorial  characterization  of  networks  that 
admit  /(f)-agreement,  for  any  function  f{t). 


We  present  the  characterization  in  this  sec¬ 
tion,  and  prove  in  the  next  section  that  there 
axe  bounded  degree  networks  that  satisfy  it 
for  t  up  to  linear  in  the  size  of  the  network, 
and  f{t)  linear  in  t. 

For  any  agreement  protocol  P,  let  P{T)  be 
any  maximal  set  of  correct  processors  that 
always  reach  agreement  under  the  protocol  P, 
independent  of  the  behavior  of  the  processors 
in  T  (thought  of  as  faulty). 

Theorem  1  [7]  Let  G  be  a  communication 
graph,  let  be  the  family  of  all  pos¬ 

sible  sets  of  faulty  processors  in  G,  and  let 
{A(Ti)}jLj  be  a  family  of  sets  of  processors 
in  G.  There  exists  a  protocol  P  s.t.  P{Ti)  = 
A(Ti)  for  i  =  I,.. .  ,k,  if  and  only  if  for  every 
pair  of  processors  u,v  E  A{Ti)  fl  A{Tj),  the 
set  Ti  UTj  does  not  disconnect  u  from  v  in  G. 

Sketch  of  the  proof:  To  prove  necessity,  we 
show  that  if  there  exist  sets  Ti  and  Tj  which 
jointly  (but  not  individually)  can  disconnect 
correct  processors  u  and  u,  then  there  ex¬ 
ist  two  scenarios,  indistinguishable  to  v,  such 
that  in  one  scenario  Ti  is  faulty  and  u  decides 
on  a  value  a,  while  in  the  second  scenario  Tj 
is  faulty  and  u  decides  on  a  value  6. 

We  prove  sufficiency  by  constructing  a  re¬ 
liable  communication  protocol  between  pairs 
of  processors  in  A{T).  We  briefly  describe  a 
few  points  of  our  construction. 

A  processor  u  transmits  a  message  to  v  by 
sending  it  along  all  simple  paths  from  u  to  v. 
As  the  message  passes  from  site  to  site,  each 
processor  appends  the  name  of  the  processor 
from  which  the  message  was  received.  Thus, 
a  message  that  passes  through  faulty  proces¬ 
sors  contains  the  name  of  at  least  one  such 
processor  (the  last  one).  Processor  v  searches 
for  a  set  Ti  such  that  all  the  messages  not 
passing  through  this  set  are  consistent  amd 
both  «  and  v  are  in  A{Ti).  Let  T  be  the 
set  of  faulty  processors  in  a  particular  execu¬ 
tion  of  our  algorithm.  If  u  and  v  are  in  A{T) 
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then  V  will  try  this  set  and  extract  the  cor¬ 
rect  value.  Crucial  to  our  algorithm  is  that  v 
will  never  extract  an  incorrect  value.  This  is 
because  by  assumption,  for  all  other  relevant 
sets  Tj,  TliTj  will  not  disconnect  u  from  v. 
Thus,  V  receives  the  message  via  at  least  one 
fault-free  path.  Therefore,  the  faulty  proces¬ 
sors  can  at  most  create  an  inconsistent  set  of 
values,  from  which  v  extracts  nothing.  □ 

3  Proof  of  the  Main  Re¬ 
sult 

Theorem  2  There  exist 

1.  Constants  a  >  0  fx,  and  d,  independent 
oft  and  n; 

2.  An  n-vertex  d-regular  network  G,  which 
can  be  explicitly  constructed; 


For  each  set  T  of  faulty  processors  we  define 
the  set  P{T)  as  the  outcome  of  the  following 
procedure: 

FUNCTION  P 

Input:  a  graph  G  =  (V,  E),  a  set  T  C 
V. 

1.  Z  * —  0; 

2.  ADD  =  {v  \  V  ^  T  U  Z,  v  has  at  least 
d/5  neighbors  in  T\J  Z]\ 

3.  WHILE  ADD  ^  0  DO 

(a)  Z  *—  Z\J  ADD-, 

(b)  ADD  =  {v  \  V  ^  T  U  Z,  v  has 
at  least  d/5  neihbors  in  TU  Z}; 

4.  END  WHILE 

5.  P{T)  ^V\{Z\JT)-, 

6.  END  FUNCTION; 


3.  A  communication  protocol  P; 

Such  that  for  any  set  of  faulty  nodes  T  in  G, 
the  communication  protocol  guarantees  reli¬ 
able  communication  between  all  pairs  of  nodes 
in  a  set  of  non-faulty  nodes  P{T),  such  that 
|P(7’)|  >  n  —  pt.  Furthermore,  the  protocol 
requires  O(logn)  communication  rounds,  and 
generates  polynomial  (in  n)  number  of  mes¬ 
sages. 

Proof:  Given  an  n-vertex  d-regular  graph 
G  =  (V,  E),  let  A(G)  denote  the  n  by  n  adja¬ 
cency  matrix  of  G.  Clearly  d  is  the  largest 
eigenvalue  of  A{G).  Let  A(G)  denote  the 
maximum  absolute  value  of  any  other  eigen¬ 
value  of  A{G).  Lubotzky  et  al.  [11]  gave  ex¬ 
plicit  construction  of  d-regular  graphs  with 
A(G)  <  2y/d  —  1,  for  any  d  =  p  -f  1,  p 
prime.  We  prove  that  a.e.-agreement  in  the 
presence  of  up  to  an  faults  can  be  achieved 
on  an  n-vertex  d-regular  network  G,  with 
A(G)  <  y/d^. 


We  first  show  that  the  set  P{T)  defined  by 
the  above  function  is  sufficiently  large. 

Lemma  1  There  exist  constants  a  >  0,  p, 
and  d,  and  an  n-vertex  d-regular  graph  G  = 
(V,E),  such  that  for  every  set  T  C  V,  \T\  < 
ajUj,  the  set  P{T)  defined  by  the  above  func¬ 
tion  is  greater  than  n  —  p\T\. 

Proof:  We  need  to  show  that  at  the  end  of 
the  execution  of  the  function  P{T),  |ruZ|  < 
pt,  for  some  constant  p. 

Alon  and  Chung  [1]  gave  the  following  re¬ 
lation  between  the  eigenvalues  of  a  graph  and 
the  density  of  its  induced  subgraphs;  Let  e(5) 
denote  the  number  of  edges  in  G  connecting 
vertices  in  5,  then  for  any  subset  S  C  V, 
|5|  =  en, 

|e(5)  -  de'^n/2\  <  X{G)e{l  -  e)n/2.  (1) 

Fix  Q  =  1/72,  p  =  6,  and  pick  an  n-node 
d-regular  network  G  =  {V,E)  with 

A(G)  <  2Vd^. 


Assume  that  there  is  a  set  T  C  K,  such  that 
<  =  l^l  <  an,  and  |r  U  Z]  >  fit.  Consider 
the  execution  of  the  function  P{T).  When  a 
vertex  is  added  to  Z  it  adds  d/5  edges  to  the 
subgraph  induced  by  TUZ.  Since  we  can  add 
the  vertices  to  Z  one  at  a  time,  if  [ruZI  >  fit, 
then  the  graph  G  has  a  subset  of  size  i  =  \_fit\ 
with  at  least  {£  —  t)d/5  internal  edges.  But 
since  fit/n  <  1/12, 

(£  -  t)d/5  >dt-  d/5  > 
{l/l2)6td/2  +  Vd  -  Ifit  > 

d{fit/nYn/2  +  {fit/n)y/d  —  1(1  —  {fit/n))n 

for  sufficiently  large  (constant)  d,  which  vio¬ 
lates  (1).  □ 

Lemma  1  shows  that  for  each  set  T,  |r|  < 
an,  the  set  P{T)  has  at  least  n  —  fi\T\  ver¬ 
tices.  We  now  need  to  show  that  the  family 
of  sets  {P{T)  \  T  eV  \T\  <  an}  satisfies  the 
condition  of  theorem  1.  We  prove  a  stronger 
result: 

Lemma  2  Given  any  two  sets  Tj ,  T2  in  G, 
such  that  Ti  C  V,  and  11} |  <  an,  for  i  =  1,2. 
Any  two  vertices  Vi,V2  €  P{Ti)  fl  /’(T2)  are 
connected  by  a  path  of  length  0(log  n)  in  the 
subgraph  induced  by  V  \  {Ti  U  72). 

Proof:  Consider  the  following  variant  of  the 
FUNCTION  P  in  which  vertices  are  added  to 
Z  when  they  have  2d/5  neighbors  in  T  U  Z 
instead  of  d/5: 

FUNCTION  P’ 

Input:  a  graph  G  =  {V,E),  a  set  T  c 

V. 

1.  Z<-0; 

2.  ADD  =  {v  \  V  \J  Z,  V  has  at  least 
2d/5  neihbors  in  T  U  Z}; 

3.  WHILE  ADD  ^  0  DO 

(a)  Z  Z  U  ADD-, 


(b)  ADD  =  {v  \v  ^TU  Z,  v  has 
at  least  2d/5  neihbors  in  T  U  Z); 

4.  END  WHILE 

5.  P>{T)  U  \  (Z  U  T); 

6.  END  FUNCTION; 

Let  Zj  denote  the  variable  Z  after  the  i-th 
iteration  of  computing  P{ti),  let  Z^  denote 
the  variable  Z  after  the  <-th  iteration  of  com¬ 
puting  P'(Ti  U  T2). 

We  prove  by  induction  on  t  that  Z/  C 
Zj  U  Z/.  The  claim  clearly  holds  for  t  = 
0.  Assume  that  the  claim  holds  for  <  —  1, 
and  assume  that  u  was  added  to  Z^  in  the 
t-th  iteration,  then  u  has  2d/3  neighbors  in 
Ti  U  T2  U  Z/_i  U  ZjLii  and  at  least  d/3  neigh¬ 
bors  in  either  Ti  U  Z/_j  or  in  r2  U  Zj^_j.  Thus 
It  is  in  Zj  U  Z/. 

Since  Z;  C  Z}  U  Zf, 

P'{TiUT2)  2  P{Ti)nPiT2). 

Furthermore, 

P'{Ti  U  T2)  n  (Ti  U  T2)  =  0. 

Thus,  it  is  enough  to  show  that  any  two  ver¬ 
tices  ui,  V2  E  U {T\,  T2)  =  P'{Ti  U  T2)  are  con¬ 
nected  by  a  path  of  length  O(logn),  in  the 
subgraph  induced  by  U(Ti,T2).  We  prove  it 
by  showing  that  the  graph  H{Ti,  T2),  induced 
by  U{Ti,T2),  is  an  expeuider. 

We  again  use  relation  (1).  In  the  origi¬ 
nal  graph  G,  no  set  of  vertices  S,  |5|  =  Bn, 
had  more  than  dB^n/2  -|-  \/d  —  10(1  —  B)n 
internal  edges  (edges  connecting  vertices  in 
5).  Consider  a  set  5  C  U{Ti,T2),  |5|  = 
Bn  <  n/2.  The  degree  of  each  vertex  in 
H{T\,T2)  is  at  least  3d/5.  Thus,  the  number 
of  edges  connecting  vertices  in  5  to  vertices 
in  IJ{T\,T2)  \  5  is  at  least 

3dBv/5  -  2dB^n/2  -  2Vd-lB{l  -  B)n 
>  Bnd /20, 
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for  sufficiently  large  (constant)  d.  Dividing  by 
the  maximal  degree  in  H{Ti,T2),  we  conclude 
that  the  set  S  is  connected  in  H{Ti,  T2)  to  at 
least  15'l/20  vertices  outside  5,  or  //(TijTa) 
is  an  expander  graph.  □ 

Conclusion  of  the  proof  of  Theorem  2:  We 
use  the  same  algorithm  as  in  the  proof  of  the¬ 
orem  1  to  obtain  reliable  communication  be¬ 
tween  pairs  of  nodes  in  P{T).  However,  mes¬ 
sages  are  sent  only  on  paths  of  length  up  to 
D{G)^  were 

D{G)  =  ma,x{diamcter{H {Ti,T2))  \ 

T.  CV,  \T,\<a\Vl  i=\,2}. 

Since  the  graph  H{Ti,T2)  is  always  an  ex¬ 
pander,  D{G)  =  C)(logn).  Thus,  the  algo¬ 
rithm  requires  only  O(logn)  communication 
rounds,  and  it  generates  polynomial  number 
of  messages. 

Assume  that  Ti  is  the  set  of  faulty  proces¬ 
sors,  ITil  <  an,  u,v  E  P{Ti),  and  v  sends 
messages  to  u.  As  in  the  proof  of  theorem  1, 
u  tries  to  extract  the  correct  value  from  the 
set  of  messages  it  receives  from  v  by  trying 
possible  sets  of  faulty  processors.  When  u  ig¬ 
nores  all  messages  that  traverse  the  set  T\,  it 
receives  a  consistent  set  of  messages  from  v, 
and  it  deduces  the  correct  value.  Our  con¬ 
struction  guarantees  that  if  u  tries  any  other 
set  T2,  there  is  a  path  of  length  no  larger  than 
D{G)  that  connects  u  to  it  and  does  not  tra¬ 
verse  vertices  in  Tj  UT2.  Thus,  when  ignoring 
messages  that  traverse  the  set  T2,  u  receives 
at  least  one  correct  message,  and  the  faulty 
processors  can  at  most  create  an  inconsistent 
set  of  values,  from  which  u  extract  nothing. 

□ 

Corollary  1  There  is  a  constant  a  >  0  and 
an  n-vertex  bounded  degree  network  G,  that 
can  be  explicitly  constructed,  such  that  G  ad¬ 
mits  a.  e.- agreement  for  up  to  an  faulty  nodes. 


Proof:  Theorem  2  proves  that  there  is  a 

bounded  degree  network  G  and  a  communi¬ 
cation  protocol  P,  such  that  for  any  set  of 
t  <  an  faults,  P  guarantees  reliable  commu¬ 
nication  between  all  pair  of  processors  in  a  set 
of  at  least  n  —  pt  non-faulty  processors. 

Let  PB  be  an  agreement  protocol  for  a 
complete  network  with  up  to  pt  faulty  pro¬ 
cessors.  Simulating  the  protocol  PB  on  the 
network  G  using  the  communication  proto¬ 
col  P  guarantees  agreement  among  at  least 
n  —  pt  non-faulty  nodes  in  the  presence  of  up 
to  t  <  an  faulty  nodes.  □ 
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Abstract:  We  consider  a  system  of  t  synchronous 
processes  that  communicate  only  by  sending  mes¬ 
sages  to  one  another,  and  that  together  must  per¬ 
form  n  independent  units  of  work.  Processes  may 
fail  by  crashing;  we  want  to  guarantee  that  in  ev¬ 
ery  execution  of  the  protocol  in  which  at  least  one 
process  survives,  all  n  units  of  work  will  be  per¬ 
formed.  We  consider  three  parameters:  the  number 
of  messages  sent,  the  total  number  of  units  of  work 
performed  (including  multiplicities),  2ind  time.  We 
present  three  protocols  for  solving  the  problem. 

All  three  are  work-optimal,  doing  0(n  + 1)  work. 
The  first  has  moderate  costs  in  the  remaining  two 
parameters,  sending  0(iy/i)  messages,  and  taking 
0(n  -b  t)  time.  This  protocol  can  be  easily  modi¬ 
fied  to  run  in  any  completely  asynchronous  system 
equipped  with  a  failure  detection  mechanism.  The 
second  sends  only  0(t  log  t)  messages,  but  its  run¬ 
ning  time  is  large  (0(t^(n  +  t)2"‘''‘)).  The  third 
is  essentially  time-optimal  in  the  (usual)  case  in 
which  there  are  no  failures,  and  its  time  complex¬ 
ity  degrades  gracefully  as  the  number  of  failures 
increases. 

1  Introduction 

A  fundamental  issue  in  distributed  computing  is 
fault- tolerance:  guaranteeing  that  work  is  per¬ 
formed,  despite  the  presence  of  failures.  For  ex¬ 
ample,  in  controlling  a  nuclear  reactor  it  may  be 
crucial  for  a  set  of  valves  to  be  closed  before  fuel  is 
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added.  Thus,  the  procedure  for  verifying  that  the 
valves  are  closed  must  be  highly  fault- tolerant.  If 
processes  never  fail  then  the  work  of  checking  that 
the  valves  are  closed  could  be  distributed  according 
to  some  load-balancing  technique.  Since  processes 
may  fail,  we  would  like  an  algorithm  that  guaran¬ 
tees  that  the  work  will  be  performed  as  long  as  at 
least  one  process  survives. 

The  notion  of  work  in  this  paper  is  very  broad, 
but  is  restricted  to  “idempotent”  operations,  that 
is,  operations  that  can  be  repeated  without  harm. 
This  is  because  if  a  process  performs  a  unit  of 
work  and  fails  before  telling  a  second  process  of  its 
achievement,  then  the  second  process  has  no  choice 
but  to  repeat  the  given  unit  of  work.  Examples 
include  verifying  a  step  in  a  formed  proof,  evaluat¬ 
ing  a  boolean  formula  at  a  petrticular  assignment  to 
the  veuiables,  sensing  the  status  of  a  valve,  closing 
a  valve,  sending  a  message,  say,  to  a  process  out¬ 
side  of  the  given  system,  printing  a  file,  or  reading 
records  in  a  distributed  database. 

Formally,  we  assume  that  we  have  a  synchronous 
system  of  t  processes  that  are  subject  to  crash  fail¬ 
ures,  that  want  to  perform  n  independent  units  of 
work.  (For  now,  we  assume  that  initially  there  is 
common  knowledge  among  the  t  processes  about 
the  n  units  of  work  to  be  performed.  We  return  to 
this  point  later.)  Given  that  performing  a  unit  of 
work  can  be  repeated  without  harm,  a  trivizd  so¬ 
lution  is  obtained  by  having  each  process  perform 
every  unit  of  work.  In  our  original  example,  this 
would  mean  that  every  process  checks  that  every 
valve  is  closed.  This  solution  requires  no  messages, 
but  in  the  worst  case  performs  tn  units  of  work  and 
runs  in  n  rounds.  (Here  the  worst  case  is  when  no 
process  fails.) 

Another  straightforward  solution  can  be  ob¬ 
tained  by  having  only  one  process  performing  the 
work  at  any  time,  smd  checkpointing  to  each  pro- 
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cess  after  completing  every  unit  of  work.  In  this 
solution,  at  most  n  +  t  —  1  units  of  work  are  ever 
performed,  but  the  number  of  messages  sent  is  al¬ 
most  in  in  the  worst  case. 

In  both  these  solutions  the  total  amount  of  ef¬ 
fort,  defined  as  work  plus  messages,  is  0{tn).  If 
the  actual  cost  of  performing  a  unit  of  work  is  com¬ 
parable  to  the  cost  of  sending  a  message,  then  nei¬ 
ther  solution  is  appealing.  In  this  abstract  we  focus 
on  solutions  which  are  work-optimal,  up  to  a  con¬ 
stant  factor,  while  keeping  the  total  effort  reason¬ 
able.  Clearly,  since  a  process  can  fail  immediately 
after  performing  a  unit  of  work,  before  reporting 
that  unit  to  any  other  process,  a  work-optimal  so¬ 
lution  performs  n  -I- 1  —  1  units  of  work  in  the  worst 
case.  Thus,  we  are  interested  in  solutions  that  per¬ 
form  0(n  -I- 1)  work. 

Let  n'  =  max(n,<).  Our  first  result  is  an  al¬ 
gorithm  whose  total  effort  is  at  most  3n'  -|-  9t\/i. 
In  fact,  in  the  worst  case  the  amount  of  work  per¬ 
formed  is  at  most  3n'  and  the  number  of  messages 
is  at  most  9<v^,  so  the  form  of  the  bound  explains 
the  costs  exactly.  We  then  optimize  this  algorithm 
to  achieve  running  time  of  0{n  -|- 1)  rounds.  Note 
that  any  solution  requires  n  rounds  in  the  worst 
case,  since  if  <  -  1  processes  are  initially  faulty  then 
the  remaining  process  must  perform  all  n  units  of 
work.  In  this  algorithm  the  synchrony  is  used  only 
to  detect  failures,  as  usual  by  detecting  the  absence 
of  an  expected  message.  Thus,  it  can  be  easily  mod¬ 
ified  to  work  in  a  completely  asynchronous  system 
equipped  with  a  failure  detection  mechanism. 

We  then  prove  that  the  above  algorithm  is 
not  message-optimal  (among  work-optimal  algo¬ 
rithms),  by  constructing  a  technically  challeng¬ 
ing  work-optimal  algorithm  that  requires  only 
0(t  logf)  messages  in  the  worst  case.  Since  0(n-l-<) 
is  a  lower  bound  on  work,  and  hence  on  effort,  the 
0(n  -I- 1  log<)  effort  of  this  algorithm  is  nearly  op¬ 
timal.  The  improved  message  complexity  is  ob¬ 
tained  by  a  more  subtle  use  of  synchrony.  In  partic¬ 
ular,  the  absence  of  a  message  in  this  algorithm  has 
two  possible  n  eanings:  either  the  potential  sender 
failed  or  it  heis  insufficient  “information”  (generally 
about  the  history  of  the  execution),  and  therefore 
has  chosen  not  to  send  a  message.  Due  to  this  use 
of  synchrony,  unlike  the  first  algorithm,  this  low- 
effort  algorithm  will  not  run  in  the  asynchronous 
model  with  failure  detection.  In  addition,  the  ef¬ 
ficiency  comes  at  a  price  in  time:  the  algorithm 
requires  0(<^(n -I- <)2”+*)  rounds  in  the  worst  case. 

The  first  two  algorithms  are  very  sequential: 


at  all  times  work  is  performed  by  a  single  ac¬ 
tive  process  who  uses  some  checkpointing  strat¬ 
egy  to  inform  other  processes  about  the  completed 
work.  This  forces  the  algorithms  to  take  at  least 
n  steps,  even  in  a  failure-free  run.  To  reduce  the 
time  we  need  to  increase  parallelism.  However, 
intuitively,  increasing  parallelism  while  simultane¬ 
ously  minimizing  time  and  remaining  work-optimal 
may  increase  communication  costs,  since  processes 
must  quickly  tell  each  other  about  completed  work. 
The  third  algorithm  does  exactly  this  in  a  fairly 
straightforward  way,  paying  a  price  in  messages  in 
order  to  decrease  best-case  time.  It  is  designed  to 
perform  time-optimally  in  the  absence  of  failures, 
and  to  have  its  time  complexity  degrade  gracefully 
with  additional  faults.  In  particular,  it  takes 
n/t  2  rounds  in  the  failure-free  case,  where  its 
message  cost  is  0(t);  its  worst-case  message  cost  is 
0{ft^),  where  /  is  the  actual  number  of  failures  in 
the  execution.  We  postpone  further  discussion  of 
this  algorithm  to  the  full  paper. 

One  application  of  our  algorithms  is  to  Byzantine 
agreement.  The  idea  is  that  the  general  tries  to  in¬ 
form  t  processes,  and  then  each  of  these  t  processes 
performs  the  “work”  of  ensuring  that  all  processes 
are  informed.  In  particular,  our  0(f  logt)-message 
solution  yields  an  agreement  algorithm  for  the  crash 
fault  model  that  requires  fewer  messages  than  any 
other  algorithm  in  the  literature.  The  best  previous 
result  is  a  nonconstructive  algorithm  due  to  Bracha 
that  requires  0{n  -f  messages,  where  n  is  the 
total  number  of  processes  in  the  system,  and  t  is  a 
bound  on  the  number  of  failures  [4]. 

Using  the  observation  that  our  solutions  to  the 
work  problem  yield  solutions  to  Byzantine  agree¬ 
ment,  we  can  now  return  to  the  assumption  that 
initially  there  is  common  knowledge  about  the  work 
to  be  performed.  Specifically,  if  even  one  process 
knows  about  this  work,  then  it  can  act  as  a  gen¬ 
eral,  run  Byzantine  agreement  on  the  pool  of  work 
using  one  of  the  three  algorithms,  and  then  the  ac¬ 
tual  work  is  performed  by  running  the  same  algo¬ 
rithm  a  second  time  on  the  real  work.  If  n,  the 
amount  of  actual  work,  is  Q(t),  then  the  overall 
cost  at  most  doubles  when  the  work  is  not  initially 
common  knowledge. 

The  idea  of  doing  work  in  the  presence  of  failures, 
in  a  different  context,  has  appeared  elsewhere.  In 
a  seminal  paper  ([5])  Kanellakis  and  Shvartsman 
consider  the  Wriie-AU  problem,  in  which  a  set  of 
n  processes  cooperates  to  set  all  n  entries  of  an 
n-element  array  to  the  value  1.  They  provide  an 
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efficient  solution  that  tolerates  up  to  n  —  1  faults, 
and  show  how  to  use  it  to  derive  robust  versions  of 
parallel  algorithms  for  a  large  class  of  interesting 
problems.  The  paper  was  followed  by  a  number  of 
papers  that  consider  the  problem  in  other  shared 
memory  models  (see  [3,  6,  7,  8,  9]). 

The  Write-All  problem  is,  of  course,  a  special 
Ccise  of  the  type  of  work  we  consider.  Neverthe¬ 
less,  our  framework  differs  from  that  of  [5]  in  two 
important  respects,  so  that  their  results  do  not 
apply  to  our  problem  (nor  ours  to  theirs).  First, 
they  consider  the  shared  memory  model  while  we 
consider  the  message  passing  model.  Using  the 
shared  memory  model  simplifies  things  consider¬ 
ably  for  our  problem.  In  the  shared  memory 
model,  there  is  a  straightforward  algorithm  (that 
uses  shared  memory  to  record  what  work  has  been 
done)  with  optimal  effort  0{n-\-t),  running  in  time 
0{nt  -I- 1^).  While  there  are  well-known  emulators 
that  can  translate  algorithms  from  the  shared  mem¬ 
ory  model  to  the  message  passing  model  (see  [1,  2]), 
these  emulators  are  not  applicable  for  our  prob¬ 
lem,  because  the  number  of  failures  they  tolerate 
is  less  than  a  majority  of  the  total  number  of  pro¬ 
cesses,  while  our  problem  allows  up  to  f  —  1  failures. 
Also,  these  transformations  introduce  a  multiplica¬ 
tive  overhead  of  message  complexity  that  is  polyno¬ 
mial  in  t,  while  one  of  our  goals  here  is  to  minimize 
this  term.^  Second,  our  complexity  measure  is  in¬ 
herently  different  from  that  of  [5].  Kanellakis  and 
Shvartsman’s  complexity  measure  is  the  sum,  over 
the  rounds  during  which  the  algorithm  is  running, 
of  the  number  of  processes  that  are  not  faulty  dur¬ 
ing  each  round.  This  measure  essentially  “charges” 
for  a  nonfaulty  process  at  round  r  whether  it  is 
actually  doing  any  work  (say,  reading  or  writing  a 
ceil  in  shared  memory),  or  not.  Our  approach  is 
generally  not  to  charge  a  process  in  round  r  if  it 
is  not  expending  any  effort  (sending  a  message  or 
performing  a  unit  of  work)  at  that  round,  since  it 
is  free  at  that  round  to  be  working  on  some  other 
task.^ 


'In  fact,  these  emulators  are  designed  for  asynchronous 
systems,  and  hence  it  may  be  possible  to  improve  their  re¬ 
silience  for  our  synchronous  model.  Nevertheless,  a  mul¬ 
tiplicative  overhead  in  message  complexity  that  is  at  least 
linear  in  t  seems  to  be  inherent  in  them. 

^Inactive  processes  in  our  algorithms  may  need  to  both 
receive  messages  and  count  the  number  of  rounds  that  have 
passed,  say  from  the  time  they  received  their  last  message. 
We  assume  that  processes  can  do  this  while  carrying  on  other 
tasks. 


2  A  Protocol  with  Effort  0(n  -I- 

Our  goal  in  this  section  is  to  present  a  protocol 
with  effort  0(n  -|-  f\/<)  and  running  time  0(n  d- 1). 
We  begin  with  a  protocol  that  is  somewhat  simpler 
to  present  and  analyze,  with  effort  0(n  t\/i)  and 

running  time  0(nt  +  t^).  This  protocol  has  the  ad¬ 
ditional  property  of  working  with  minimal  change 
in  an  asynchronous  environment  with  failure  detec¬ 
tion. 

The  main  idea  of  the  protocol  is  to  use  check¬ 
pointing  in  order  to  avoid  redoing  too  much  work  if 
a  process  fzuls.  The  most  naive  approach  to  check¬ 
pointing  does  not  work.  To  understand  why,  sup¬ 
pose  a  process  does  a  checkpoint  after  each  n/k 
units  of  work.  This  meems  that  up  to  n/k  units 
of  work  are  lost  when  a  process  fails.  Since  up  to 
t  processes  may  fail,  this  means  that  nt/k  units 
of  work  can  be  lost  (and  thus  must  be  repeated), 
which  suggests  we  should  take  fc  >  f  if  we  want 
to  do  no  more  than  0(n)  units  of  work  altogether. 
However,  since  each  checkpoint  involves  t  messages, 
this  means  that  roughly  tk  messages  will  be  sent. 
Thus,  we  must  have  ib  <  >/f  if  we  are  to  use  fewer 
than  t>/i  messages.  Roughly  speaking,  this  argu¬ 
ment  shows  that  doing  checkpoints  too  infrequently 
means  that  there  might  be  a  great  deal  of  wasted 
work,  while  doing  them  too  often  meams  that  there 
will  be  a  great  deal  of  message  overhead.  Our  proto¬ 
col  avoids  these  problems  by  doing  full  checkpoints 
to  all  the  processes  relatively  infrequently — after 
n/y/i  units  of  work — but  doing  partial  checkpoints 
to  only  y/i  processes  after  every  n/t  units  of  work. 
This  turns  out  to  be  just  the  right  compromise. 

2.1  Description  of  the  Algorithm 

For  ease  of  exposition,  we  assume  that  t  is  a  per¬ 
fect  square,  and  that  n  is  divisible  by  t  (so  that, 
in  particular,  n  >  t).  We  leave  to  the  reader  the 
easy  modifications  of  the  protocol  when  these  as¬ 
sumptions  do  not  hold.  We  assume  that  the  pro¬ 
cesses  are  numbered  0  through  <  —  1,  and  that  the 
units  of  work  are  numbered  1  through  n.  We  di¬ 
vide  the  processes  into  y/i  groups  of  size  y/t  each, 
and  use  the  notation  gi  to  denote  process  Ts  group. 
(Note  gi  =  [(i -I-  l)/v^.)  We  divide  the  work  into 
■s/i  chunks,  each  of  size  n/y/i,  and  subdivide  the 
chunks  into  y/i  subchunks  of  size  n/t. 

The  protocol  guarantees  that  at  each  round,  at 
most  one  process  is  active.  The  active  process  is  the 
only  process  performing  work.  If  process  i  is  active, 
then  it  knows  that  processes  0  to  » —  1  have  crashed 
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or  terminated.  Initially,  process  0  is  active.  The 
algorithm  for  process  0  is  straightforward;  Process 
0  starts  out  doing  the  work,  a  subchunk  at  a  time. 
After  completing  a  subchunk  c,  it  does  a  check¬ 
point  to  the  remaining  processes  in  its  group  go 
(processes  1  to  >/t  —  1);  that  is,  it  informs  its  group 
that  the  subchunk  of  work  has  been  completed  by 
broadcasting  to  the  processes  in  its  group  a  mes¬ 
sage  of  the  form  (c,5o)-  (If  process  0  crashes  in  the 
middle  of  a  broadcast,  we  assume  only  that  some 
subset  of  the  processes  receive  the  message.)  We 
call  this  a  partial  checkpoint,  since  the  checkpoint¬ 
ing  is  only  to  the  processes  in  go .  After  completing 
a  whole  chunk  of  work — that  is,  after  completing  a 
subchunk  c  which  is  a  multiple  of  y/i — process  0  in¬ 
forms  all  the  processes  that  chunk  c  has  been  com¬ 
pleted,  but  it  informs  them  one  group  at  a  time. 
After  informing  a  whole  group,  it  checkpoints  the 
fact  that  a  group  has  been  informed  to  its  own 
group  (i.e.,  group  1).  Formally,  after  completing 
a  chunk  c  that  is  a  multiple  of  \/i,  process  0  does 
a  partial  checkpoint  to  its  own  group,  and  then  for 
each  group  2, . . . ,  \/i,  process  0  broadcasts  to  the 
processes  in  group  g  a  message  of  the  form  {c,g), 
and  then  broadcasts  to  all  the  processes  in  its  own 
group  a  message  of  the  form  {c,g).  We  call  this 
a  full  checkpoint.  Note  that  in  a  full  checkpoint, 
there  is  really  a  double  checkpointing  process;  we 
checkpoint  both  the  fact  that  work  has  been  com¬ 
pleted,  and  (to  the  processes  in  go)  the  fact  that 
all  processes  have  been  informed  that  the  work  has 
been  completed.  Process  0  terminates  after  sending 
the  message  (t,  \/<)  to  process  t— 1,  indicating  to  the 
last  process  that  the  last  chunk  of  work  has  been 
completed  (unless  it  crashes  before  that  round). 

If  process  0  crashes,  we  want  process  1  to  become 
active;  if  process  1  crashes,  we  want  process  2  to  be¬ 
come  active,  and  so  on.  More  generally,  if  process  j 
discovers  that  the  first  j  —  l  processes  have  crashed, 
then  it  becomes  active.  Once  process  j  becomes 
active,  it  continues  with  essentially  the  same  algo¬ 
rithm  as  process  0,  except  that  it  does  not  repeat 
the  work  it  knows  has  already  been  done.  We  must 
ensure  that  the  takeover  proceeds  in  a  “smooth” 
manner,  so  that  there  is  at  most  one  active  process 
at  a  time. 

Process  j’s  algorithm  is  as  follows.  If  j  does 
not  know  that  all  the  work  has  already  been  per¬ 
formed  and  sufficiently  long  time  has  passed  from 
the  beginning  of  the  execution,  then  j  becomes 
active.  “Sufficiently  long”  means  long  enough  to 
ensure  that  processes  0,  ...,j  —  1  have  crashed 


or  terminated.  As  we  show  below,  we  can  take 
“sufficiently  long”  to  be  defined  by  the  function 
DD{j)  =  j(n  -I-  3<).  (“DD”  stands  for  deadline. 
We  remark  that  this  is  not  an  optimal  choice  for 
the  deadline;  we  return  to  this  issue  later.)  Thus,  if 
the  round  number  r  is  <  DD{j),  then  j  does  noth¬ 
ing.  Otherwise,  if  j  does  not  know  that  the  work 
is  completed,  it  takes  over  as  the  active  process  at 
round  DD(j). 

When  j  takes  over  as  the  active  process,  it  essen¬ 
tially  repeats  process  O’s  algorithm.  Suppose  the 
last  message  j  received  was  of  the  form  (c,  g).  Then 
j  starts  by  checkpointing  the  fact  that  it  is  now  ac¬ 
tive  to  the  remaining  processes  in  its  own  group 
gj  (those  processes  with  numbers  higher  than  j, 
since  the  remainder  are  known  to  have  crashed  or 
terminated),  by  broadcasting  the  message  {c,g)  to 
them.  Next  there  are  now  two  cases.  If  c  is  not 
a  multiple  of  y/i,  then  j  continues  with  the  work 
in  subchunk  c  -b  1.  Process  j  does  a  partial  check¬ 
point  after  completing  each  subchunk  d,  informing 
the  remaining  members  of  its  group  that  d  has  been 
completed  by  broadcasting  the  message  {d,gj).  If 
d  marks  the  completion  of  a  whole  chunk  of  work, 
then  process  j  performs  a  full  checkpoint,  inform¬ 
ing  all  processes,  a  group  at  a  time,  by  broadcasting 
the  message  (d,  g)  to  group  g,  and  checkpointing  to 
the  remaining  members  of  its  own  group  after  com¬ 
pleting  the  checkpoint  to  group  g  by  broadcasting 
to  them  the  message  (d,  g).  If  c  is  a  multiple  of  y/i, 
then  j  continues  with  the  full  checkpoint,  starting 
with  group  J  -b  1.  That  is,  it  broadcasts  to  each 
group  h  =  g  +  I, .  ..,y/i  the  message  (c,  h),  each 
time  checkpointing  its  progress  in  the  full  check¬ 
point  by  broadcasting  (c.  A)  to  the  remaining  pro¬ 
cesses  in  gj .  Thus,  if  a  process  i  receives  a  message 
of  the  form  (c,  gi),  it  learns  that  subchunk  c  and  all 
lower  numbered  subchunks  have  been  completed. 
If  it  receives  a  message  of  the  form  (c,  (/)  for  g  ^  gi, 
then  the  sender  of  the  message  is  in  t’s  group,  a 
full  checkpoint  is  in  progress,  and  group  g  has  been 
informed  that  subchunk  c  and  all  lower  numbered 
subchunks  have  been  completed. 

Process  j  terminates  either  upon  receiving  a  mes¬ 
sage  of  the  form  {t,g)  (since  then  it  knows  that  all 
the  work  has  been  completed)  or  after  sending  the 
message  {t,  y/t)  to  process  t  —  I  (unless  it  crashes 
before  that  round).  (Of  course,  if  process  t  —  1  be¬ 
comes  active,  it  terminates  after  completing  all  the 
work,  since  it  never  has  to  send  checkpointing  mes¬ 
sages.)  This  completes  the  description  of  our  first 
protocol.  We  call  this  Protocol  A- 
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Notice  that  we  can  easily  modify  this  algorithm 
to  run  in  a  completely  asynchronous  system  with 
a  failure  detection  mechanism.  We  assume  that, 
if  someone  fails,  then  the  failure  detection  mecha¬ 
nism  will  eventuadly  inform  all  the  processes  that 
have  not  failed  of  this  fact.  The  modification  is 
trivial;  rather  than  waiting  until  round  DD(i)  be¬ 
fore  becoming  active,  process  i  waits  until  it  has 
been  informed  that  processes  1, . . . ,  i  —  1  crashed  or 
terminated. 

2.2  Analysis  and  Proof  of  Correctness 

We  now  give  a  fairly  complete  correctness  proof  for 
this  protocol,  to  give  the  reader  an  idea  of  the  type 
of  arguments  that  need  to  be  made.  (Full  proofs  of 
correctness  of  our  protocols  are  omitted  for  lack  of 
space.)  We  say  a  process  is  retired  if  it  has  either 
crashed  or  terminated. 

Lemma  2.1  A  process  performs  at  most  n  units  of 
work,  sends  at  most  3t\/i  messages,  and  runs  for 
less  than  n  +  it  rounds  from  the  time  it  becomes 
active  to  the  time  it  retires.  | 

The  following  lemma  is  now  immediate  from  the 
definition  of  DD. 

Lemma  2.2  Assume  process  i  becomes  active  at 
round  r  of  an  execution  of  protocol  A.  Then  all 
processes  <  i  have  retired  before  round  r.  | 

In  the  sequel,  it  will  sometimes  be  convenient  to 
view  a  group  as  a  whole.  Therefore  we  say  that 
a  group  ts  active  in  the  period  starting  when  some 
process  in  this  group  becomes  active  and  ending 
when  the  last  process  of  this  group  retires.  Notice 
that  Lemma  2.2  ensures  that  when  </<  becomes  ac¬ 
tive,  all  processes  in  smaller  groups  have  retired. 

Theorem  2.1  In  every  execution  of  protocol  A, 

(a)  no  more  than  3ti  units  of  work  are  performed 
in  total  by  the  processes; 

(b)  no  more  than  9t-\/i  messages  are  sent;  and 

(c)  by  round  nt  -|-  it^,  all  processes  have  retired. 

Proof:  Part  (c)  is  immediate  from  Lemma  2.2 
and  the  definition  of  DD. 

We  prove  parts  (a)  and  (b)  simultaneously.  To 
do  so,  we  need  a  careful  way  of  counting  the  total 
number  of  messages  sent  and  the  total  amount  of 


work  done.  A  given  unit  of  work  may  be  performed 
a  number  of  times.  If  it  is  performed  more  than 
once,  say  by  processes  t'l , . . . ,  Ui  we  say  that  re¬ 
does  that  unit  of  work  of  n ,  »3  redoes  the  work  of 
t2,  etc.  It  is  important  to  note  that  13  does  not  redo 
the  work  of  ij  in  this  case;  only  that  of  12.  Similarly, 
we  cEui  tedk  about  a  message  sent  during  a  partial 
checkpoint  of  a  subchunk  or  a  full  checkpoint  of  a 
chunk  done  by  ii  as  being  resent  by  12. 

Since  the  completion  of  a  chunk  is  followed  by  a 
full  checkpoint,  it  is  not  hard  to  show  that  when 
a  new  group  becomes  active,  it  will  redo  at  most 
one  chunk  of  work  that  was  already  done  by  pre¬ 
vious  active  groups.  It  will  also  redo  at  most  one 
full  checkpoint  that  was  done  already  on  the  previ¬ 
ous  chunk,  and  -s/i  partial  checkpoints  (one  for  each 
subchunk  of  work  redone).  Finally,  if  gj  <  gi,  and 
the  last  message  sent  by  process  j  before  crashing 
is  a  broadcast  to  process  t’s  group  that  was  not  re¬ 
ceived  by  i,  process  t  must  resend  this  broadcast. 
In  all,  it  is  easy  to  see  that  at  most  n/y/i  units  of 
work  done  by  previous  groups  are  redone  when  a 
new  group  becomes  active,  and  it  messages  are  re¬ 
sent.  Similetrly,  since  the  completion  of  a  subchunk 
is  followed  by  a  partial  checkpoint,  it  is  not  hard  to 
show  that  when  a  new  process,  say  i,  in  a  group 
that  is  already  active  becomes  active,  and  the  last 
message  it  received  was  of  the  form  {c,gi)  (i.e.,  a 
partial  checkpoint  of  subchunk  c),  it  will  redo  at 
most  one  subchunk  that  was  already  done  by  previ¬ 
ous  active  process  (namely,  c-|-l),  and  may  possibly 
resend  the  messages  in  two  partied  checkpoints:  the 
one  sent  edter  subchunk  c,  and  the  one  sent  after 
subchunk  c-|- 1  (if  the  previous  process  crashed  dur¬ 
ing  the  checkpointing  of  c  -I-  1  without  t  receiving 
the  message).  If  the  last  message  that  »  received 
was  (c,  g)  for  g  >  gi  (that  is,  the  checkpointing  of 
a  checkpoint  in  the  middle  of  a  full  checkpoint), 
then  similar  arguments  show  that  it  may  resend 
iy/i  messages:  the  checkpoint  of  (c,^)  to  its  own 
group,  the  checkpoint  (c,  -1- 1)  to  group  y  -I- 1,  and 
the  checkpointing  of  {c,g  +  1)  to  its  own  group. 
Thus,  the  ^unount  of  work  done  by  an  active  group 
that  is  redone  when  a  new  process  in  that  group 
becomes  active  is  at  most  n/t,  and  the  number  of 
messages  resent  is  at  most  iy/i. 

The  maximum  amount  of  unnecessary  work  done 
is;  (number  of  groups)  x  (amount  of  work  redone 
when  a  new  group  becomes  active)  -f-  (number  of 
processes)  x  (amount  of  work  redone  when  a  new 
process  in  an  already  active  group  becomes  active) 
<  y/i{n/y/i)+t{n/t)  =  2n.  Similarly,  the  maximum 
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number  of  unnecessary  messages  that  may  be  sent 
is  no  more  than:  (number  of  groups)  x  (number  of 
messages  resent  when  a  new  group  becomes  active) 
+  (number  of  processes)  x  (number  of  mesages  sent 
when  a  new  process  in  an  already  active  group  be¬ 
comes  active)  <  v/t(3t)  +  t3v^  =  6ty/i.  Clearly  n 
units  of  work  must  be  done;  by  Lemma  2.1,  at  most 
ity/i  messages  must  be  sent.  Thus,  no  more  than 
3n  units  of  work  will  be  done  altogether,  and  no 
more  than  9t\/t  messages  will  be  sent  altogether.  I 

2.3  Improving  the  Time  Complexity 

As  we  have  observed,  the  round  complexity  of  Pro¬ 
tocol  A  is  nt  -I-  3f^ .  We  now  discuss  how  the  pro¬ 
tocol  can  be  modified  to  give  a  protocol  that  has 
round  complexity  0{n  +  t),  while  not  significantly 
changing  the  amount  of  work  done  or  the  number 
of  messages  sent. 

Certainly  one  obvious  hope  for  improvement  is 
to  use  a  better  function  than  DD  for  computing 
when  process  i  should  become  active.  While  some 
improvement  is  possible  by  doing  this,  we  can  get 
a  round  complexity  of  no  better  than  0(ny/i)  if 
this  is  all  we  do,  which  is  still  more  than  we  want. 
Intuitively,  the  problem  is  that  if  process  j  gets  a 
message  of  the  form  {c,g),  where  c  is  a  multiple 
of  \/i,  then  it  is  possible,  as  far  as  j  is  concerned, 
that  some  other  process  t  <  j  may  have  received  a 
message  of  the  form  (c  -f-  \/i,  h).  Process  j  cannot 
become  active  before  it  is  sure  that  i  has  retired. 
To  compute  how  long  it  must  wait  before  becoming 
active,  it  thus  needs  to  compute  how  long  t  would 
wait  before  becoming  active,  given  that  i  got  a  mes¬ 
sage  of  the  form  (c  y/t,  A).  On  the  other  hand, 
if  i  did  get  such  a  message,  then  as  far  as  i  is  con¬ 
cerned,  some  process  i'  <  i  may  have  received  a 
message  of  the  form  (c  +  2y/i,  A').  Notice  that,  in 
this  case,  process  j  knows  perfectly  well  that  no 
process  received  a  message  of  the  form  (c-J-2\/7,  A'); 
the  problem  is  that  i  does  not  know  this,  and  must 
take  into  account  this  possibility  when  it  computes 
how  long  to  wait  before  becoming  active.  Carrying 
out  a  computation  based  on  these  arguments  gives 
an  algorithm  which  runs  in  0{ny/i)  rounds. 

On  closer  inspection,  it  turns  out  that  the  situ¬ 
ation  described  above  really  causes  difficulties  only 
when  all  processes  involved  (in  the  example  above, 
this  would  be  the  processes  j,  i,  and  i')  are  in  the 
same  group.  Thus,  in  our  modified  algorithm,  pro¬ 
cess  j  computes  the  time  to  become  active  as  fol¬ 
lows:  Suppose  that  the  last  message  received  by 


process  j  before  round  r  is  (c,g),  and  this  mes¬ 
sage  was  received  from  process  »  at  round  r'.  Pro¬ 
cess  j  then  computes  a  function  F(j,c,g,t)  with 
the  property  that  if  r  =  r'  F(j,  c,  g,  i),  then 
process  j  knows  at  round  r  that  all  processes  in 
groups  g'  <  gj  must  have  retired.  (If  gi  =  gj  then 
F{j,c,g,i)  =  0.)  Process  j  then  polls  all  the  lower- 
numbered  processes  in  its  own  group,  one  by  one,  to 
see  if  they  are  alive;  if  not,  then  j  becomes  active.  If 
any  of  them  is  alive,  then  the  lowest-numbered  one 
that  is  alive  becomes  active  upon  receiving  j’s  mes¬ 
sage.  Once  a  process  becomes  active,  it  proceeds 
just  as  in  Protocol  A-  This  technique  turns  out  to 
save  a  great  deal  of  time,  while  costing  relatively 
little  in  the  way  of  messages.  We  leave  details  of 
the  computation  of  F{j,c,g,i)  and  the  correctness 
of  this  protocol  to  the  full  paper. 

3  An  Algorithm  with  Effort  0{n  -1- 
tlogf) 

In  this  section  we  prove  that  the  effort  of  0(n-l-f  v^) 
obtained  by  the  previous  protocols  is  not  optimal, 
even  for  work-optimal  protocols.  We  construct  an¬ 
other  work-optimal  algorithm.  Protocol  C,  that  re¬ 
quires  only  0(n  -t-  flogt)  messages  (and  a  variant 
that  requires  only  0{t\ogt)  messages),  yielding  a 
total  effort  of  0(n  +  ilogi).  As  is  the  case  with 
protocols  A  and  B,  at  most  one  process  is  active 
at  any  given  time.  However,  in  protocol  C  it  is 
not  the  case  that  there  is  a  predetermined  order  in 
which  the  processes  become  active.  Rather,  when 
an  active  process  fuls,  we  want  the  process  that 
is  currently  most  knowledgeable  to  become  the  new 
active  process.  As  we  shall  see,  which  process  is 
most  knowledgeable  after  an  active  process  t  fails 
depends  on  bow  many  units  of  work  t  performed 
before  failing.  As  a  consequence,  there  is  no  obvi¬ 
ous  variant  of  protocol  C  that  works  in  the  model 
with  asynchronous  processes  and  a  failure-detector. 

Roughly  speaking.  Protocol  C  strives  to  “spread 
out”  as  uniformly  as  possible  the  knowledge  of  work 
that  has  been  performed  amd  the  processes  that 
have  crashed.  Thus,  each  time  the  active  process, 
say  i,  performs  a  new  unit  of  work  or  detects  a  fail¬ 
ure,  1  tells  this  to  the  process  j  it  currently  consid¬ 
ers  least  knowledgeable.  Then  process  j  becomes 
as  knowledgeable  as  i,  so  after  performing  the  next 
unit  of  work  (or  detecting  another  failure),  i  tells 
the  process  it  now  considers  least  knowledgeable 
about  this  new  fact. 

The  most  naive  implementation  of  this  idea  is 
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the  following:  Process  0  begins  by  performing  unit 
1  of  work  and  reporting  this  to  process  1.  It  then 
performs  unit  2  and  reports  units  1  and  2  to  process 
2,  and  so  on,  telling  process  i  mod  t  about  units 
1  through  i.  Note  that  at  all  times,  every  process 
knows  about  all  but  at  most  the  last  t  units  of  work 
to  be  performed. 

If  process  0  crashes,  we  want  the  most  knowledge¬ 
able  alive  process — the  one  that  knows  about  the 
most  units  of  work  that  have  been  done — to  become 
active.  (If  no  process  alive  knows  about  any  work, 
then  we  want  the  highest  numbered  alive  process  to 
become  active.)  It  can  be  shown  that  this  can  be 
arranged  by  setting  appropriate  deadlines.  More¬ 
over,  the  deadlines  are  chosen  so  that  at  most  one 
process  is  active  at  a  given  time.  The  most  knowl¬ 
edgeable  process  then  continues  to  perform  work, 
always  informing  the  least  knowledgeable  process. 

The  problem  with  this  naive  algorithm  is  that  it 
requires  0{n  + work  and  0(n  -|-  messages  in 
the  worst  case.  For  example,  suppose  that  process 
0  performs  the  first  t  —  1  units  of  work,  so  that 
the  last  process  to  be  informed  is  process  t  —  I, 
and  then  crashes.  In  addition,  t/2  +  -  1 

crash.  Eventually  process  t/2,  the  most  knowledge¬ 
able  non-retired  process,  will  become  active.  How¬ 
ever,  process  t/2  has  no  way  of  knowing  whether 
process  0  crashed  just  after  informing  it  about  work 
unit  t/2,  or  process  0  continued  to  work,  informing 
later  processes  (who  must  have  crashed,  for  oth¬ 
erwise  they  would  have  become  active  before  pro¬ 
cess  t/2).  Thus,  process  t/2  repeats  work  units 
t/2+  —  1,  again  informing  (retired)  pro¬ 

cesses  t/2-t- 1, . .  .t  —  1.  Suppose  process  t/2  crashes 
after  performing  work  unit  t  —  1  and  informing 
process  t  —  1.  Then  process  t/2  —  I  becomes  ac¬ 
tive,  and  again  repeats  this  work.  If  each  process 
</2— l,</2  —  2,...,1,  crashes  after  repeating  work 
units  t/2-\-l, .  ..,t  —  l,  then  0{t^)  work  is  done,  and 
O(t^)  messages  are  sent.  (A  slight  variant  of  this 
example  gives  a  scenario  in  which  0(n  -f  t^)  work 
is  done,  and  0(n  -f- 1^)  messages  are  sent.) 

To  prevent  this  situation,  a  process  performs  fail¬ 
ure  detection  before  proceeding  with  the  work.  The 
key  idea  here  is  that  we  treat  failure  detection  as 
another  type  of  work.  This  allows  us  to  use  our 
algorithm  recursively  for  failure  detection.  Specif¬ 
ically,  fault-detection  is  accomplished  by  polling  a 
process  and  waiting  for  a  response  or  a  timeout. 
The  difficulty  encountered  by  our  approach  is  that, 
in  contrast  to  the  real  work,  the  set  of  faulty  pro¬ 
cesses  is  dynamic,  so  it  is  not  obvious  how  these  pro¬ 


cesses  can  be  detected  without  sending  (wasteful) 
polling  messages  to  nonfauUy  processes.  In  fact, 
in  our  algorithm  we  do  not  attempt  to  detect  all 
the  faulty  processes,  only  enough  to  ensure  that 
not  too  much  work  is  wasted  by  reporting  work  to 
faulty  processes. 

3.1  Description  of  the  Algorithm 

For  ease  of  exposition  we  assume  <  is  a  power  of  2. 
Again,  the  processes  are  numbered  0  through  t  —  1, 
and  the  units  of  work  are  numbered  1  through  n. 
Although  our  algorithm  is  recursive  in  nature,  it 
can  more  easily  be  described  when  the  recursion 
is  unfolded.  Processing  is  divided  into  log<  lev¬ 
els,  numbered  1  to  logt,  where  level  logf  would 
have  been  the  deepest  level  of  the  recursion,  had  we 
presented  the  algorithm  recursively.  In  each  level, 
the  processes  are  partitioned  into  groups  as  follows. 
In  level  h,  I  <  h  <  logt,  there  are  </(2*°*‘“*'''^) 
groups  of  size  Thus,  in  level  logf,  there 

axe  t/2  groups  of  size  2,  in  level  log<  — 1  there  are  t/A 
groups  of  size  4,  and  so  on,  until  level  1,  in  which 
there  is  a  single  group  of  size  t.  Let  —  2*°8*-^+i 
denote  the  size  of  a  group  at  level  h.  The  first  group 
of  level  h  contains  processes  0, 1 , . . . ,  —  1 ,  the  next 

group  contains  processes  Sh ,  sji  -1- 1, . . . ,  2sh  - 1,  and 
so  on.  Thus  each  group  of  level  h  <  logt  contains 
two  groups  of  level  h  +  I-  Note  that  each  process 
a  belongs  to  logf  groups,  exactly  one  on  each  level. 
We  let  G\  denote  the  level  h  group  of  process  i. 

Initially  process  0  is  active.  When  process  a 
becomes  active,  it  performs  fault-detection  in  its 
group  at  every  level,  beginning  with  the  highest 
level  and  working  its  way  down,  leaving  level  h  as 
soon  as  it  finds  a  non-faulty  process  in  G\.  Once 
fault-detection  has  been  completed  on  G\,  the  set 
of  all  processes,  process  i  begins  to  perform  real 
work.  Thus,  we  sometimes  refer  to  the  actual  work 
as  Go,  or  level  0,  and  the  fault-detection  on  level 
h  as  work  on  level  h.  For  each  I  <  h  <  log<,  each 
time  it  performs  a  unit  of  work  on  G^_  j ,  process  a 
reports  that  work  to  some  process  in  G\. 

A  unit  of  fault-detection  is  performed  by  sending 
a  special  message  “Are  you  alive?”  to  one  process, 
and  waiting  for  a  reply  in  the  following  round.  An 
ordinary  message  informs  a  process  at  some  level  h, 
1  <  A  <  logf,  of  a  unit  of  (real  or  fault-detection) 
work  at  level  A  —  1.  As  we  shall  see,  an  ordinary 
message  also  carries  additional  information.  These 
two  are  the  only  types  of  messages  sent  by  an  active 
process.  As  before,  a  process  that  has  crashed  or 
terminated  is  said  to  be  retired.  An  inactive  non- 
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retired  process  only  sends  responses  to  “Are  you 
alive?”  messages. 

Each  process  i  maintains  a  list  Fi  of  processes 
known  by  t  to  be  retired.  It  also  maintains  an  array 
of  pointers,  POINTS ,  indexed  by  group  name.  Intu¬ 
itively,  POINT,  [Go]  is  the  successor  of  the  last  unit 
of  work  known  by  i  to  have  been  performed  (and 
therefore  this  is  where  i  will  start  doing  work  when 
it  becomes  active).  For  h  >  1,  poiNTifG^j  contains 
the  successor  (according  to  the  cyclic  order  in  G\) 
of  the  last  process  in  G\  known  by  i  to  have  re¬ 
ceived  an  ordinary  message  from  a  process  in  Gj 
that  was  performing  (real  or  fault-detection)  work 
on  G^_j.  We  cadi  POINTi[G^]  process  i’s  pointer 
into  G\.  Process  i’s  moves  are  governed  entirely 
by  the  round  number,  F,,  and  pointers  into  its  own 
groups  (i.e.,  pointers  into  groups  Gj,).  Associated 
with  each  pointer  POlNTi[G]  is  a  round  number, 
ROUNDi[G],  indicating  the  round  at  which  the  last 
message  known  to  be  sent  was  sent  (or,  in  the  case 
of  Go,  when  the  last  unit  of  work  known  to  be  done 
was  done).  Initially,  POINT j [Go]  =  1,  POINT, [G{] 
is  the  lowest-numbered  process  in  G^  —  {i},  and 
ROUND j  [Go]  =  ROUND j[G^]  =  0.  We  occasionally 
use  ROUNDi[G](r)  to  denote  the  value  of  ROUNDj[(j^ 
at  the  beginning  of  round  r;  we  similarly  use  Fi(r) 
and  POlNTj[G](r). 

The  triple  (F<, POiNTj, ROUNDj)  is  the  view  of 
process  j,  Wc  also  define  the  reduced  view  of  pro¬ 
cess  i  to  be  POlNTj[Go]  -  I  +  [Ejl;  thus,  i’s  reduced 
view  is  the  sum  of  the  number  of  units  of  work 
known  by  i  to  be  done  and  the  number  of  processes 
known  by  i  to  be  faulty.  A  process  includes  its  view 
whenever  it  sends  an  ordinary  message.  When  pro¬ 
cess  i  receivto  an  ordinary  message,  it  updates  its 
view  in  light  of  the  new  information  received.  Note 
that  process  i  may  receive  information  about  one 
of  its  own  groups  from  a  process  not  in  that  group. 
Similarly,  it  may  pass  to  another  process  informa¬ 
tion  about  a  group  in  which  the  other  process  is  a 
member  but  to  which  i  does  not  belong. 

Let  G\  be  any  group  as  described  above,  where 
the  process  numbers  range  from  a:  to  y  =  x-b  [G),!  — 
1.  There  is  a  natural  fixed  cyclic  order  on  the  group, 
which  we  call  the  cyclic  order.  Process  i  sends  mes¬ 
sages  to  members  of  G\  in  increasing  order.  By  this 
we  mean  according  to  the  cyclic  order  but  skipping 
itself  and  all  processes  in  F,.  Let  j  ^  i  be  in  G\. 
Then  j’s  i-successor  in  G\,  is  j’s  nearest  succes¬ 
sor  in  the  cyclic  ordering  that  is  not  in  {i}  U  F,. 
We  omit  the  i  in  “i-successor,”  as  well  as  the  name 
of  the  group  in  which  the  successor  is  to  be  deter¬ 


mined,  when  these  are  clear  from  the  context. 

When  process  i  first  becomes  active  it  searches 
for  other  non-retired  processes  as  follows.  For  each 
level  h,  starting  with  log<  and  going  down  to  1, 
process  i  polls  group  G\,  8t^lrting  with  poiNTj[G),], 
by  sending  an  “Are  you  alive?”  message.  If  no  an¬ 
swer  is  received,  it  adds  this  process  to  F,.  If  h  < 
logt,  process  i  sends  an  ordinary  message  reporting 
this  newly  detected  failure  to  POINT, [G^^J,  sets 
point,[G^^j]  to  its  i-successor  in  and  sets 

ROUNDi[G5,.^j]  to  the  current  round  number.  Pro¬ 
cess  i  repeats  these  steps  until  an  answer  is  received 
or  G\  \  {i}  C  F,.  It  then  enters  level  h  —  I,  and 
repeats  the  process.  Level  0  is  handled  similarly 
to  levels  1  through  logt  —  I,  but  the  process  per¬ 
forms  real  work  instead  of  polling,  and  increases  the 
work  pointer  after  performing  each  unit  of  work.  If 
point,[Go]  =  "  then  process  i  halts,  since  in  this 
case  all  the  work  has  been  completed.  This  com¬ 
pletes  the  description  of  the  behavior  of  an  active 
process.  The  code  for  an  active  process  appears  in 
Figure  1. 

At  any  time  in  the  execution  of  the  algorithm, 
each  inactive  non-retired  process  t  has  a  deadline. 
We  define  D(i,  m)  to  be  the  number  of  rounds  that 
process  i  waits  from  the  round  in  which  it  first  ob¬ 
tained  reduced  view  m  until  it  becomes  active: 


F(n-bt-m)2"+‘-**'” 
A:(<  -  »)(n -b  t)2"+‘-i 


if  m  >  I 
otherwise. 


where  K  =  5t  +  2 logt.  As  we  show  below 
(Lemma  3.2),  K  is  an  upper  bound  on  the  number 
of  rounds  that  any  process  needs  to  wait  before  first 
hearing  from  the  active  process.  (More  formally,  if 
J  becomes  active  at  round  r  and  is  still  active  K 
rounds  later,  then  by  the  beginning  of  round  r  +  K, 
all  processes  that  are  not  retired  will  have  received 
a  message  from  j.)  All  our  arguments  below  work 
without  change  if  we  replace  K  by  any  other  bound 
on  the  number  of  rounds  that  a  process  needs  to 
wait  before  first  hearing  from  the  active  process. 
This  observation  will  be  useful  later,  when  we  con¬ 
sider  a  slight  modification  of  protocol  C. 

If  process  i  receives  no  message  by  the  end  of 
D(i,  0)  —  1,  then  it  becomes  active  at  the  beginning 
of  round  £>(i,  0).  Otherwise,  if  at  round  r  it  receives 
a  message  based  on  which  it  obtains  a  reduced  view 
of  m,  and  if  it  receives  no  further  messages  by  the 
end  of  round  r  -b  D(i,  m)  —  I,  it  becomes  active  at 
the  beginning  of  round  r-b  D(i,  m).  This  completes 
the  description  of  kite  algorithm. 
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1.  h  :=  log<; 

2.  While  h>  0  do: 

3.  DONE  :=  FALSE; 

4.  While  ->DONE  do: 

5.  Send  “Are  you  alive?”  to  POlNTi[Cjj,]; 

6.  If  no  response 

7.  then  add  POlNTi[Gy  to 

9.  If  A  ^  log  i 

10.  then  send  ordinary  message  to  point, 

11.  ROUND, [G^^i]  :=  cttrrenl  round; 

12.  POlNTi[G*fc+i]  :=  «ucce««or(poiNT<[G*fc+i]); 

13.  IfGj,-F,9fc{i} 

14.  then  point,[G^]  :=  5ttcces«or(poiNT,[Gjj); 

15.  else  DONE  :=  TRUE 

16.  else  (i.e.,  response  received)  DONE  :=  TRUE; 

17.  h-.=  h-l- 


Process  level  0  (real  work): 

18.  While  POlNTi[Go]  <  n  do: 

19.  Perform  work  unit  POINT,[Go]; 

20.  If  POlNTj[Go]  ^  n  then 

21.  Send  an  ordinary  message  to  POINTjfG)]; 

22.  ROUND,- [Gj]  :=  current  round-, 

23.  POINT, [G)]  :=  successor(poiNTj[G*i]); 

24.  pointJGo]  :=  successor(poiNT,[Goj); 


Figure  1 :  Code  for  Active  Process  i  in  Protocol  C 


3.2  Analysis  and  Proof  of  Correctness 

Lemma  3.1  In  every  execution  of  Protocol  C  in 
which  there  are  no  more  than  t  —  1  failures,  the 
work  is  completed.  | 

The  next  lemma  shows  that  our  choice  of  K  has 
the  properties  mentioned  above. 

Lemma  3.2  If  j  is  active  at  round  r,  and  is  not 
retired  by  round  r  +  5t  +  21ogt,  then  all  processes 
that  are  not  retired  will  receive  a  message  from  j 
before  the  beginning  o/r  +  5t  +  21ogt.  | 

If  I  received  its  last  ordinary  message  from  j  at 
round  r,  we  call  other  processes  that  received  an 
ordinary  message  from  j  after  t  did  first-generation 
processes  (implicitly,  with  respect  to  i,  j,  and  r).  If 
I  did  not  yet  receive  any  ordinary  messages,  then 
the  first-generation  processes  (with  respect  to  i  and 
r)  are  those  that  received  an  ordinary  message  from 
a  process  with  a  number  greater  than  t.  We  define 


kth  generation  processes  inductively.  If  we  have  de¬ 
fined  tth  generation,  then  the  {k  -b  l)8t  generation 
are  those  processes  that  receive  an  ordinary  mes¬ 
sage  from  a  kth  generation  process.  The  rank  of  a 
process  is  the  highest  generation  that  it  is  in. 

Lemma  3.3  Let  i  receive  its  last  ordinary  message 
from  j  at  round  r,  lei  m  be  the  reduced  view  of  i 
after  receiving  this  message,  and  let  (  be  a  kih  rank 
process  with  respect  to  i,  j,  and  r.  Then,  after  t 
receives  its  last  ordinary  message,  its  reduced  view 
is  at  least  m  -I-  k.  | 

We  say  process  i  knows  more  than  process  j  at 
round  r  if  F,(r)  D  Fj(r)  and  for  all  groups  G, 
ROUNDj[G](r)  >  ROUND;  [^(r).  Note  that  if  equal¬ 
ity  bolds  everywhere  then  intuitively  the  two  pro¬ 
cesses  are  equally  knowledgeable.  We  first  show 
that  our  algorithm  has  the  property  that  for  any 
two  inactive  non-retired  processes,  one  of  them  is 
more  knowledgeable  than  the  other,  unless  they 
both  know  nothing;  that  is,  the  knowledge  of  two 
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non-retired  processes  is  never  incomparable.  This 
is  important  so  that  the  “most  knowledgeable”  pro¬ 
cess  is  well-defined.  Moreover,  the  knowledge  can 
be  quantified  by  the  reduced  view.  Process  i  knows 
more  than  inactive  process  j  if  and  only  if  the  re¬ 
duced  view  of  i  is  greater  than  the  reduced  view 
of  j.  Finally,  the  algorithm  also  ensures  that  the 
active  process  is  at  least  as  knowledgeable  as  any 
inactive  non-retired  process. 

Lemma  3.4  For  every  round  r  of  the  execution  the 
following  hold: 

(a)  If  process  i  received  an  ordinary  message  from 
process  j  at  round  r'  <  r,  and  i  is  inactive 
and  has  not  retired  by  the  beginning  of  round  r, 
then  at  the  beginning  of  round  r,  no  processes 
other  than  j  and  processes  in  the  kth  genera¬ 
tion  with  respect  to  i,  j,  and  r',  for  some  lb  >  1, 
know  as  much  as  i. 

(b)  Suppose  process  i  received  its  last  ordinary 
message  at  round  r'  (if  i  has  received  no  or¬ 
dinary  messages  then  r'  =  0^,  and  m  is  i’s 
reduced  view  after  receiving  this  message.  If 
i  IS  not  retired  at  the  beginning  of  round  r  = 
r'  +  D{i,m),  and  it  receives  no  further  ordi¬ 
nary  messages  before  the  beginning  of  round  r, 
then  at  the  beginning  of  round  r  no  non-retired 
process  knows  more  than  i. 

(c)  At  most  one  process  is  active  in  round  r. 

(d)  At  l;.e  beginning  of  round  r,  there  is  an  asym¬ 
metric  total  order  (“knows  more  than”)  on  the 
non-zero  knowledge  of  the  inactive  non-retired 
processes,  and  the  active  process  knows  at  least 
as  much  as  the  most  knowledgeable  among 
these  processes.  Moreover,  for  any  two  non- 
retired  processes  i  and  j,  i  knows  more  than  j 
if  and  only  if  the  reduced  view  of  i  is  greater 
than  the  reduced  view  of  j.  | 

Lemma  3.5  The  running  time  of  the  algorithm  is 
at  most  tK{n  -f- rounds. 

Proof:  If  process  t’s  reduced  view  is  m  and  it  does 
not  receive  a  message  within  D(i,m)  steps,  then 
it  becomes  active.  Each  message  that  i  receives 
increases  its  reduced  view.  Thus,  i  becomes  active 

in  at  most  D(i,0)  4- - h  D(i,n  -h  t  -  1)  rounds. 

Once  it  becomes  active,  arguments  similar  to  those 
used  in  Lemma  3.2  show  that  it  retires  in  at  most 
2n  4-  3t  4-  2  log!  rounds.  Thus,  the  running  time  of 


the  algorithm  is  at  most  D(l,0)4---(-D(l,n4-t  — 
1)  4-  2n  4-  3t  -h  21ogt  <  tK{n  4- 1)2"'*'‘  rounds.  | 

The  next  lemma  shows  that  an  active  process  i 
does  not  send  messages  to  retired  processes  that, 
because  they  were  more  knowledgeable  than  t, 
should  have  become  active  before  i  did.  These  mes¬ 
sages  are  avoided  because  during  fault  detection  i 
discovers  that  these  processes  have  retired. 

Lemma  3.6  If  process  i'  gets  an  ordinary  message 
at  round  r'  from  a  process  operating  on  group 
and  process  i  is  active  at  the  beginning  of  round 
r  >  r'  then: 

(a)  i/ ROUND j[G^](r)  >  r',  then  all  processes  m 
the  inlcmo/ [:',  POINT, [G^](r))  tn  the  cyclic  or¬ 
der  on  G),  are  either  retired  by  the  beginning  of 
round  r  or  receive  an  ordinary  message  in  the 
interra/ [r',  ROUNDj[Gfc](r)]  from  a  process  op¬ 
erating  on  G\_j.  (If  i'  =  POlNTi[Gfc](r),  then 
all  processes  in  G\  are  either  retired  by  the  be¬ 
ginning  of  round  r  or  receive  a  message  m  the 
interval  [r',  ROUNDi[G*](r)]  from  a  process  op¬ 
erating  on  G),_j._^  Moreover,  either  i’s  knowl¬ 
edge  at  the  beginning  of  round  r  is  greater  than 
F’s  knowledge  at  the  end  of  r' ,  or  i'  6  Fi(r). 

(b)  if  ROUND, [G),’](r)  <  r',  then  all  processes  in 
the  inlen)o/[POlNTi[G),Kr),  i']  in  the  cyclic  or¬ 
der  on  G\  are  either  retired  by  the  beginning  of 
round  r' ,  or  receive  a  message  in  the  interval 
(ROUNDj[G),](r),  r']  from  a  process  operating 
on  G^_i-  Moreover,  all  the  processes  in  this 
interval  are  retired  by  the  beginning  of  round 
r,  and  if  G'|^  =  G), ,  then  all  these  processes 
will  be  in  Fi  by  the  time  i  begins  to  operate  on 

CPh-i  t 

Observe  that  the  algorithm  treats  ‘are  you  alive?’ 
messages  as  real  work.  Therefore,  in  the  seeuel,  we 
will  refer  to  these  messages  as  work  unless  stated 
otherwise.  On  the  other  hand,  the  ordinary  mes¬ 
sages  are  still  referred  to  as  messages. 

Using  Lemma  3.6,  we  can  show  that  indeed  effort 
is  not  wasted: 

Lemma  3.7  The  number  of  work  units  done  and 
reported  to  G\  by  group  G\  when  operating  on 
group  G\_i  is  no  more  than  |G),|  4-  |G^_i|. 

Proof:  Given  i,  h,  and  an  execution  e  of  protocol 
C,  we  consider  the  sequence  of  triples  (x,  y,  z),  with 
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one  triple  in  the  sequence  for  every  time  a  process 
X  G  G),  sends  an  ordinary  message  reporting  a  unit 
of  work  y  G  G\_i  to  a  process  2  G  Gj,,  listed  in 
the  order  that  the  work  was  performed.  We  must 
show  that  the  length  of  this  sequence  is  no  more 
than  |Gi_.l  +  |Gi,|. 

We  say  that  a  triple  (x,y,z)  is  repeated  in  this 
sequence  if  there  is  a  triple  (x',y,z')  later  in 
the  sequence  where  the  same  work  unit  y  is  per¬ 
formed.  Clearly  there  are  at  most  |G),_j|  nonre- 
peated  triples  in  the  sequence,  so  it  suffices  to  show 
that  there  are  at  most  |G),|  repeated  triples.  To 
show  this,  it  suffices  to  show  that  the  third  com¬ 
ponents  of  repeated  triples  (denoting  which  pro¬ 
cess  was  informed  about  the  unit  of  work)  are  dis¬ 
tinct.  Suppose,  by  way  of  contradiction,  that  there 
are  two  repeated  triples  (xi,yi,  zj)  and  (x2.y2.21) 
with  the  same  third  component.  Suppose  that 
Xi  informed  2]  about  yi  in  round  r',  and  X2  in¬ 
formed  2i  about  y2  in  round  r".  Without  loss 
of  generality,  we  can  assume  that  r'  <  r".  Since 
(2^i.yi.2i)  is  a  repeated  triple,  there  is  a  triple 
(■<‘3.yi.22)  after  (xi,yi,2i)  in  the  sequence.  Let 
ra  be  the  round  in  which  X3  became  active,  and 
let  r2  be  the  round  in  which  X2  became  active.  Let 
Sj  =  ROtJNDj.^[GJ,](r^),  for  j  =  2,3.  By  Lemma3.6, 
if  52  >  fhen  either  X2’s  knowledge  at  the  be¬ 
ginning  of  round  $2  is  greater  than  zi ’s  knowledge 
at  the  end  of  r',  or  21  G  Fr^ir'),  and  if  S2  <  r', 
then  2i  G  F,,  before  X2  starts  operating  on  G*"*. 
Since  X2  sends  a  message  to  21  while  operating  on 
G^~ * ,  it  cannot  be  the  case  that  2i  G  Fr,  before 
X2  starts  operating  on  GJ*"'  ,  so  it  must  be  the  case 
that  52  >  r'  and  xa’s  knowledge  at  the  beginning  of 
round  r2  is  greater  than  21  ’s  knowledge  at  the  end 
of  round  r'.  In  particular,  this  means  that  X2  must 
know  that  xi  informed  21  about  y\  at  the  beginning 
of  r2. 

We  next  show  that  every  process  x  G  G*  that 
is  active  at  some  round  r  between  r'  and  r2  must 
know  that  xi  informed  zi  about  yi  at  the  beginning 
of  round  r.  For  suppose  not.  Then,  by  Lemma  3.6, 
2]  must  have  retired  by  the  beginning  of  round  r. 
Since,  by  Lemma  3.4,  x  is  the  most  knowledgeable 
process  at  the  beginning  of  round  r,  it  follows  that 
no  process  that  is  not  retired  knows  that  zj  was 
informed  about  yi .  Thus,  there  is  no  way  that  X2 
could  find  this  out  by  round  r2. 

It  is  easy  to  see  that  X3  does  not  know  that  2i 
was  informed  about  yi  (for  if  it  did,  it  would  not 
repeat  the  unit  of  work  yi).  Therefore,  (x3,yi,22) 
must  come  after  (X2,y2.2i)  in  the  sequence.  Since 


POlNT*j[G),](r")  =  2i,  and  2i  received  an  ordinary 
message  from  xi  while  operating  on  G),_j  at  round 
r',  it  follows  from  Lemma  3.6  that  between  rounds 
r'  and  r",  every  process  in  G\  that  is  not  retired 
must  receive  an  ordinary  message.  In  particular, 
this  means  that  X3  must  receive  an  o.  dinary  mes¬ 
sage.  Since  all  active  processes  between  round  r' 
and  r"  know  that  21  was  informed  about  yi,  it  fol¬ 
lows  that  X3  must  know  it  too  by  the  end  of  round 
r".  But  then  X3  would  not  redo  yj ,  giving  us  the 
desired  contradiction.  | 

Theorem  3.1  In  every  execuUon  of  Protocol  C  the 
following  hold: 

(a)  The  total  amount  of  real  work  performed  is  no 
more  than  n-\-2t  units; 

(b)  The  number  of  messages  sent  is  no  more  than 
n  6<  Iog<  -f  4<; 

(c)  The  total  number  of  rounds  is  no  more  than 

t(5<-f  21og<)(n-|-02"+‘. 

Proof:  Lemma  3.7  implies  that  the  amount  of 
real  work  units  that  are  performed  and  reported 
to  Gi  is  no  more  than  |Go|  -I-  |Gi|  -  n  -h  L  In  addi¬ 
tion,  each  of  the  t  processes  may  perform  one  unit 
without  reporting  it  (because  it  retired  immediately 
afterwards).  Summing  the  two,  (a)  follows. 

Part  (b)  follows  fairly  easily  from  Lemma  3  7, 
while  part  (c)  is  immediate  from  Lemma  3.5.  | 

We  remark  that  we  can  improve  the  message 
complexity  to  0(t  log<)  (that  is,  remove  the  n  term 
in  (b)  above)  by  informing  processes  in  group  G’l 
after  n/t  units  of  work  done  at  level  Go,  rather 
than  after  every  unit  of  work.  The  total  work  done 
is  still  0(n  +  <);  the  time  complexity  increases  to 
<(2n-I-3/-|-21og<)(n-|-<)2"+‘  because  of  an  increase 
in  K  (the  upper  bound  on  the  number  of  rounds, 
from  the  time  the  currently  active  process  takes 
over,  that  any  process  needs  to  wait  before  first 
hearing  from  the  active  process). 

4  Application  to  Byzantine  Agree¬ 
ment 

Each  of  our  algorithms  can  be  used  to  construct  an 
algorithm  for  Byzantine  agreement  along  the  fol¬ 
lowing  lines.  The  general  sends  its  value  to  pro¬ 
cesses  T  =  {0,...,/}  (note  that  at  least  one  of 
these  processes  is  non-faulty)  and  then  decides  on 
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this  value.  The  <  +  1  processes  then  must  perform 
the  “work”  of  informing  first  the  rest  of  T  and  then 
the  remaining  n  —  t  —  1  processes  of  the  value  heard 
from  the  general.  Thus,  the  units  of  work  are,  in  or¬ 
der,  informing  processes  1, 2, . . . ,  n.  At  all  times  a 
process’  current  value  is  the  last  value  it  has  heard 
(0  if  it  has  never  heard  anything).  When  a  pro¬ 
cess  becomes  active,  it  informs  others  of  its  current 
value.  At  the  end  of  the  algorithm  it  decides  on  its 
current  value. 

The  proof  of  correctness  of  this  algorithm  varies 
according  to  which  of  Algorithms  A,  B,  and  C  is 
used  for  performing  work.  In  the  first  two  cases 
the  proof  relies  on  the  fact  that  processes  are  in¬ 
formed  about  work  (more  or  less)  in  the  same  or¬ 
der  in  which  the  work  was  performed.  In  the  last 
case  the  proof  depends  on  the  fact  that  the  active 
process  is  always  the  most  knowledgeable  one. 

We  remark  that  not  every  algorithm  for  perform¬ 
ing  work  yields  an  algorithm  for  Byzantine  agree¬ 
ment  along  the  lines  that  we  have  described  (con¬ 
sider,  for  example,  the  trivial  algorithm  for  per¬ 
forming  work,  in  which  all  processes  perform  all  n 
units  of  work). 

5  Conclusions 

In  this  paper  we  have  formulated  the  problem  of 
performing  work  efficiently  in  the  presence  of  faults. 
We  presented  three  work-optimal  protocols  to  solve 
the  problem.  One  sends  0{ty/i)  messages  and  takes 
0{n  +  t)  time,  another  requires  O(flogf)  messages 
at  the  cost  of  significantly  greater  running  time.  In 
the  full  paper  we  present  an  algorithm  that  opti¬ 
mizes  on  time  in  the  usual  case  (where  there  are 
few  failures).  In  particular,  in  the  failure-free  case, 
it  takes  n/t  +  2  rounds  and  requires  0(f)  messages. 
Its  time  performance  degrades  gracefully  with  ad¬ 
ditional  failures,  and  its  worst-case  message  com¬ 
plexity  is  0(/f^),  where  /  is  the  actual  number  of 
faults  in  the  execution. 

It  would  be  interesting  to  see  if  message  com¬ 
plexity  and  running  time  could  be  simultaneously 
optimized.  It  would  also  be  interesting  to  prove  a 
nontrivial  lower  bound  on  the  message  complexity 
of  work-optimal  protocols. 
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ABSTRACT 

We  consider  distributed  computations  in  an 
asynchronous  communication  model  with 
undetectable  link  failures.  The  computational 
tasks  we  consider  are  obtaining  the  value  of  a 
predetermined  function  of  the  local  inputs 
scattered  in  the  network  (e.g.,  the  sum  of  all  local 
values).  We  call  this  task  Global  Computation. 

A  trivial  protocol  for  Global  Computation 
consists  of  each  processor  sending  its  local  input 
to  all  processors  via  flooding.  Our  aim  is  to 
justify  the  use  of  this  simple  protocol,  in  the 
presence  of  faulty  links,  by  proving  matching 
lower  bounds  on  the  message  complexity  (i.e., 
total  number  of  messages  sent)  of  Global 
Computation. 

In  this  paper  we  concentrate  on  the  case  in 
which  the  communication  links  are  either  uni¬ 
directional  or  fail  in  a  uni-directional  manner. 
Our  main  result  states  that  for  every  n  and  m,  the 
message  complexity  of  Global  Computation  on 
such  networks  is  at  least 

n-m 

Poly  Log  (n) 

where  n  is  the  number  of  processors  and  m  is  the 
number  of  iiitks.  Hence,  in  the  presence  of  uni¬ 
directional  link  failures,  the  simple  flooding 
algorithm  is  optimal  up  to  a  polylogarithmic 
factor. 
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1.  INTRODUCTION 

We  consider  distributed  systems  consisting 
of  a  set  of  processors  and  a  set  of  links 
connecting  pairs  of  processors.  A  basic  task  in  a 
distributed  system  is  computing  a  predetermined 
function  of  local  inputs  scattered  in  the  system. 
For  example,  one  may  be  interested  in  computing 
the  sum  of  all  local  inputs.  In  general,  one  is 

interested  in  computing/(xi . x„).  where /is  a 

predetermined  function  and  x,-  is  the  local  input  of 
processor  i.  The  result  of  the  computation  should 
be  known  to  all  (or  one  oO  the  processors  in  the 
network.  In  general,  computing  the  value  of  / 
may  require  knowledge  of  all  the  local  iiqnits. 
The  question  is  what  is  the  cost,  qiecifically  the 
message  complexity  (i.e.,  the  numbw  of  messages 
sent),  of  obtaining  this  knowledge.  The  answer, 
of  course,  depends  on  the  computational  model. 

The  natural  computation  models,  for  the 
above  task,  vary  by  the  qualityAeliability  of  the 
links  connecting  the  various  processors.  We 
believe  that  these  different  models  reflect  different 
"levels  of  abstraction"  applied  to  prK;tical 
networks.  A  high-level  model  may  consider 
communication  netwmics  as  sujqxxting 
synchronous  communication.  A  more  low-level 
model  assumes  only  asynchronous  message 
transmissions  (but  alibws  no  faults).  Assuming 
asynchronous  communications  widi  detectable 
faults  is  even  more  low-level,  and  waiving  the 
assumption  that  faults  are  detectable  goes  even 
further.  It  should  be  stressed  that  all  levels  of 
abstraction  are  justified  in  some  sense  although 
none  captures  r^ity.  Yet,  one  should  remember 
that  these  different  levels  of  abstraction 
correspond  to  different  layers  of  communication 
protocols  operating  in  the  network  (e.g.,  ISO 
layers)  and  that  high  levels  of  abstraction  are 
obtained  at  the  cost  of  more  complex  or 
expensive  protocols. 
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In  this  paper  we  consider  a  very  weak 
model  of  communication.  Specifically,  we 
consider  an  asynchronous  (message  passing) 
network  with  undetectable  (fail-stop)  link  failures 
[IR,SR,GS].  The  number  of  link  failures  is  not 
a-priori  bounded,  except  that  it  is  guaranteed  that 
these  failures  never  disconnect  the  network. 

A  trivial  protocol  for  computing  the  value 
of  /  consists  of  each  processor  sending  its  local 
input  to  all  other  processors  in  the  network  (by 
using  a  flooding  protocol).  Of  course,  for  a 
speciflc  "degenerated"  function  /  (e.g.,  a  constant 
function)  much  beucr  protocols  are  possible. 
However,  we  are  interested  in  alternative  (and 
possibly  cheaper)  protocols  for  more  typical 
functions  and  specifically  for  functions  which  are 
"sensitive  to  all  their  inputs"  (e.g.,  SUM,  AND, 
MAX  etc.).  Loosely  speaking,  our  main  result  is 
that  in  an  asynchronous  model  of  uni-directional 
links  with  undetectable  faults  no  significant 
saving  in  complexity  is  possible.  A  more  precise 
statement  follows. 

We  show  that  there  exists  a  polynomial  P 
so  that  for  every  n  and  m  the  following  holds. 
Let  n  be  an  arbitrary  protocol,  for  computing  the 
sum  (or  any  other  input-sensitive  function)  of  the 
local  values  residing  in  the  processors  of  some 
network  with  n  processors  and  m  uni-directional 
links  (which  may  fail-stop  in  an  undetectable 
manner).  Then  there  exists  an  execution  of  FI,  on 

such  a  network,  in  which  at  least  — f*  ^  , 

Pilogn) 

messages  are  sent.  We  stress  that  our  lower 
bound  holds  also  in  case  only  faulty  links  behave 
in  a  uni-directional  manner.  It  should  be  stressed 
that  the  trivial  protocol  mentioned  above  can 
compute  any  function  of  the  local  inputs  using 
n-m  messages. 

Of  course,  we  would  have  been  more 
happy  to  prove  the  above  result  in  a  model  in 
which  bodi  faulty  and  non-faulty  links  are  bi¬ 
directional.  Yet,  the  uni-directional  model  of 
faults  is  well  motivated.  Even  in  a  setting  in 
which  it  is  reasonable  to  assume  that  the  initial 
topology  of  the  network  is  "bi-directional"  it  is 
sometimes  natural  to  postulate  that  the  faults 
occurring  in  one  direction  of  a  link  are 
"independent"  of  the  performance  of  the  other 
direction.  Hence,  the  faults  may  be  uni¬ 
directional  even  if  non-faulty  links  are  bi¬ 
directional.  Furthermore,  we  hope  that  some  of 
the  ideas  presented  in  the  proof  of  the  lower 
bound  would  be  useful  also  for  proving  an 
analogous  lower  bound  in  the  bi-directional  fault 
model. 


2.  THE  MODEL 

Our  model  of  computation  is  an 
asynchronous  model  of  uni-directional 
communication  with  undetectable  link  failures. 
Namely,  we  consider  an  arbitrary  directed  graph 
with  processors  placed  at  the  nodes  and  directed 
edges  representing  uni-directional  communication 
links.  Links  are  directed  from  their  tail  to  their 
head.  Each  processor  runs  a  predetermined  local 
program.  An  assignment  of  local  programs  to  all 
processors  of  the  network  is  called  a  protocol  (or 
an  algorithm).  An  execution  of  a  protocol  on  the 
above  network  is  determined  both  by  the  protocol 
(and  the  initial  local  inputs)  and  by  a  scheduling 
of  events  agreeing  with  the  standard  link  axioms. 
(The  scheduling  determines  which  of  the 
receive-message  events  that  may  occur  in  a 
processor  will  occur  first  A  different  scheduling 
of  the  receive-message  events  may  cause  the 
processor  to  behave  differently.  Specifically,  this 
may  cause  different  send-message  events  at  the 
processor.)  Loosely  speaking,  the  link  axioms 
determine  that  receive-message  events  occurring 
at  the  head  of  a  link  must  be  preceded  by  a 
"corresponding"  send-message  event  occurring  at 
the  tail  of  this  link.  The  correspondence  of 
receive-message  and  send-message  events  over 
each  link  is  a  one-to-one  (but  not  necessarily 
onto)  function  mapping  receive-message  events  at 
the  head  of  the  link  to  send-message  events  at  the 
tail  of  the  link.  Furthermore,  this  function  must 
be  order  preserving  (i.e.,  if  a  receive-message 
event  r  i  precedes  a  receive-message  event  r2  at 
the  head  of  the  link  then  the  send-message  event 
corresponding  to  r\  must  precede  the  send- 
message  event  corresponding  to  rz). 

A  link  is  said  to  be  faulty  during  an 
execution  if  the  execution  terminates  and  there 
exists  a  send-message  event  at  the  tail  of  this  link 
without  a  corresponding  receive-message  event 
(at  its  head).  Links  which  do  not  fail  during  an 
execution  are  said  to  be  non-faulty  during  the 
execution.  We  stress  that  during  the  execution 
the  processors  may  not  be  able  to  detect  whether 
a  link  is  faulty  or  noL  The  statement  that  a  link  is 
faulty  is  transcendental  to  the  network  (i.e.,  it  is 
made  once  the  execution  terminates  and  by  an 
outside  observer).  We  make  no  restrictions  on 
the  number  of  faulty  links  during  an  execution. 
Instead  we  require  that  during  every  execution  the 
directed  subgraph,  defined  by  the  non-faulty 
links,  remains  strongly  connected. 

Our  lower  bound  applies  also  to  the 
restricted  case  in  which  links  are  fail-stop,  namely 
for  each  link  the  sequence  of  send-message 
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events  having  corresponding  receive-message 
events  is  a  prefix  of  the  sequence  of  send- 
message  events.  Furthermore,  it  applies  also  to 
networks  with  bi-directional  links  vulnerable  to 
uni-directional  faults  (such  networks  can  be 
rq)resented  as  directed  graphs  with  anti-parallel 
edges). 

Our  complexity  measure  for  computational 
tasks  is  the  number  of  messages  sent  during  the 
"worst"  execution  of  the  "best"  protocol 
achieving  the  ‘a.*'!'  Namely,  we  consider  all 
protocols  arhi;.«ing  the  task,  and  for  each 
protocol  we  consider  all  po<'sible  local  inputs  and 
all  possible  executio>,3  o*'  the  protocol  on  these 
inputs. 

The  computational  task  we  consider  is 
called  Global  Computation.  Global  Computation 
of  a  function  /  is  the  task  terminating  so  that  one 
designated  processor  outputs /(vi,...,v„),  where 
V,'  is  the  local  input  of  processor  i,  and  n  is  the 
number  of  processors  in  the  network.  The  task  is 
initiated  by  the  same  designated  processor.  It  can 
be  easily  shown  that  other  versions  of  the  task, 
such  as  the  version  in  which  the  result  of 
computation  has  to  be  obtained  in  all  processors, 
are  not  easier  (and  are  not  much  harder  either). 

The  complexity  of  global  computation  of  a 
function  /  may  depend  on  the  function  /  itself. 
However,  this  dependency  is  not  the  focus  of  the 
current  paper.  In  particular,  we  are  not  interested 
in  "degenerate"  functions  which  do  not  depend  on 
all  their  inputs.  Instead,  we  are  interested  in  the 
complexity  of  global  computations  of  input 
sensitive  functions.  An  n-ary  function  /  is  called 
input  sensitive  if  there  exists  a  sequence  of  values 
(vi,...,v„)  such  that  for  every  i  there  exists  a  Ui 
such  that 

/(Vl . V,_,,V„V,>,,...,V„) 

y  (v  ] V,_]  ,M,',  V„) 

For  example,  SUM  is  input  sensitive,  and  so  are 
AND,  MAX,  and  many  other  natural  functions. 


3.  A  LOWER  BOUND  FOR  THE 
COMPLETE  GRAPH 

In  this  section  we  present  a  tight  lower 
bound,  on  the  complexity  of  Global  Computation, 
for  the  case  of  a  complete  communication  graph. 
This  communication  graph  consists  directed 
edges  between  every  two  vertices  (in  both 
directions).  We  show  that  in  an  execution  of  any 
algorithm,  which  performs  Global  Computation 
on  a  complete  graph  with  4n-t-I  vertices,  at  least 
messages  are  sent 


Throughout  the  rest  of  this  section  we 
denote  by  G(V,E)  the  complete  graph,  and 
assume  that  IVI=4n  +  l,  fw  some  n.  We 
denote  the  predetermined  process^  which 
initiates  the  execution  by  /  and  partition  the  rest 
of  the  4n  vertices  into  three  subsets.  The  vertices 
of  the  first  subset  are  called  starters  and  are 
denoted  by  The  vertices  of  the  second 

subset  are  called  transmitters  and  are  denoted  by 
{T, )  .  The  vertices  of  the  last  subset  are  called 

receivers  and  are  denoted  by  {/?i  }f=i  (the  names 
given  to  the  vertices  are  derived  from  their  role  in 
the  following  prooO. 

From  now  on  we  are  interested  only  in  a 
subset  of  the  grab’s  edges,  so  we  drop  all  the 
other  edges  f.om  the  graph.  One  may  assume  that 
all  these  edges  are  faulty  in  every  execution,  and 
that  no  message  sent  over  such  edge  reaches  its 
destination.  The  edges  which  remain  are  (see 
figure  3.1)  as  follows:  For  every  starter  5,-,  we 
keep  the  edge  (/,S,).  We  call  these  edges  trigger 
edges  (the  meaning  of  an  edge’s  name  is  related 
to  its  role  in  the  prooO.  For  every  starter  Si  and 
every  transmitter  Tj,  we  keep  the  edge  iSi,Tj). 
We  call  these  edges  routing  edges.  For  every 
transmitter  T,  and  every  receiver  Rj,  we  keep  the 
edge  (Ti,Rj).  We  call  these  edges  charge  edges. 
We  also  keep  for  every  vertex  in  the  graph, 
v€  V-{/),  the  edges  (/,v)  and  (v,/).  These 
edges  are  called  auxiliary  edges  and  are  not 
shown  in  Figure  3.1.  Note  that  the  set  of  auxiliary 
edges  contains  the  set  of  trigg^  edges. 


Figure  3.1:  the  graph’s  structure  for  n  =4 


Any  algorithm  which  performs  Global 
Computation  on  the  graph  G,  must  work  correctly 
for  any  scheduling  of  the  messages  delays  (as 
long  as  it  agrees  with  the  FIFO  rule  for  each 
edge).  In  our  proof  we  present  a  scheduler  which 
causes  sending  at  least  messages  during  an 
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3.2  below. 


execution  of  any  such  algorithm. 

From  now  on  we  fix  an  algorithm  A 
(initialed  by  the  processor  /)  which  performs 
Global  Computation  on  the  graph  G  and  describe 
the  scheduler’s  strategy.  At  the  beginning  of  the 
execution,  the  scheduler  delays  all  messages  sent 
over  the  trigger  edges  and  enables  arrival  of 
messages  sent  over  all  other  edges  (in  an  arbitrary 
order)  until  a  stable  state  of  the  network  is 
reached. 

Definition  3.1:  The  execution  reaches  a  stable 
state  when  there  are  no  more  messages  in 
transmit  over  the  non-delayed  edges  (i.e.,  every 
message  sent  over  a  non-delayed  edge  is  received 
at  its  other  end). 

When  a  stable  state  of  the  network  is 
reached  the  scheduler  "releases"  one  of  the 
delayed  edges  (i.e.,  this  edge  is  no  longer  delayed 
and  arrival  of  messages  sent  over  it  is  enabled). 
The  scheduler  waits  till  the  execution  reaches  a 
stable  state  again.  We  stress  that  an  edge  which 
had  been  released  is  never  delayed  again  during 
this  execution.  The  scheduler  continues  this 
process  (of  releasing  delayed  edges)  until  all  the 
n  trigger  edges  are  released.  Thus,  we  have  n 
such  phases,  from  the  first  release  of  a  delayed 
trigger  edge,  till  the  execution  of  algorithm  A  is 
ended  (Global  Computation  can  not  be  completed 
before  all  trigger  edges  are  released,  because  the 
initiator  I  needs  to  get  information  from  all  the 
starters  in  order  to  compute  the  desired  input- 
sensitive  function  and  the  starters  do  not  send 
messages  before  receiving  a  message  over  their 
incoming  trigger  edges). 

From  now  on  we  confine  ourselves  to 
schedulers  as  described  above  and  only  take 
advantage  on  our  freedom  to  choose  the  order  in 
which  the  delayed  edges  are  released.  Although 
this  is  a  special  case  of  scheduling,  the  messages 
complexity  of  algorithm  A  is  measured  over  all 
possible  schedules  and  therefore  proving  lower 
bound,  for  these  schedules  only,  suffices. 

Claim  3.1:  Suppose  that  the  execution  (of  the 
algorithm)  is  in  a  stable  state  and  that  the  set  of 
delayed  edges  is  not  empty.  Then,  there  exists  a 
delayed  edge  and  a  corresponding  schedule 
(which  starts  by  the  release  of  this  edge)  such  that 
this  schedule  yields  sending  at  least  messages 
(over  the  charge  edges)  before  the  next  stable 
state  is  reached. 

In  order  to  prove  Claim  3.1,  we  first  prove  Claim 


Definition  3.2:  Suppose  that  the  execution  is  in  a 
stable  state  and  that  (/,5,-)  is  a  trigger  edge  which 
is  delayed  at  this  state.  A  routing  edge  {Si.Tj), 
for  some  l^j ^2n,  is  called  good,  if  releasing 
the  trigger  (7,5,-)  and  scheduling  message 
delivery  only  along  the  path  /  -^  5,-  — >  Tj  causes 
Tj  VO  send  messages  over  all  its  n  outgoing 
charge  edges.  Otherwise  the  routing  edge  (5,-,  T 
is  called  bad. 

We  stress  that  an  edge  is  defined  good  with 
respect  to  a  specific  stable  state.  Also,  in  defining 
the  edge  (Si,Tj)  as  either  good  or  bad,  we 
consider  only  the  behavior  of  Ty  in  schedules  in 
which  (after  the  current  stable  state)  both  5,-  and 
Tj  receive  messages  only  from  I  and  S,- 
respectively.  Finally,  if  (Si,Tj)  is  bad  it  means 
that  there  exists  a  receiver  so  that  if  we 
schedule  events  only  along  the  path 
I  ^Si—*Tj—*Rit  no  message  from  S,-  will 
reach  R/^  (i.e.,  Tj  will  not  send  a  message  to  Rt). 
In  this  case,  the  edge  (Tj,Ri^)  is  called  a  blocking 
edge  for  the  edge  (Si,Tj). 

Claim  3.2:  Suppose  that  the  execution  (of  the 
algorithm)  is  in  a  stable  state  and  that  the  set  of 
delayed  edges  is  not  empty.  Then,  there  exists  a 
delayed  edge  (7,5/)  such  that  5,-  has  at  least  n 
good  outgoing  routing  edges  (out  of  its  2n 
outgoing  routing  edges). 

Proof  of  Claim  3.2:  Let  us  assume  to  the  contrary 
that  the  execution  is  at  a  stable  state  and  for  every 
delayed  edge  (7,5(),  the  vertex  Si  has  more  than 
n  outgoing  routing  edges  which  are  bad.  We  can 
assign  each  delayed  edge  one  of  its  corresponding 
bad  routing  edges  so  that  these  bad  edges  form  a 
matching  (i.e.,  each  5,-  is  assigned  to  a  different 
Tj).  Such  an  assignment  exists  because  there  are 
at  most  n  delayed  edges  and  every  delayed  edge 
has  more  than  n  possible  corresponding  bad 
routing  edges.  Let  M  denote  the  above  resulting 
matching. 

With  respect  to  the  above  stable  state,  we 
describe  a  schedule  that  contradicts  the 
correemess  of  algorithm  A.  The  schedule  consists 
of  a  setting  of  the  edges  to  either  fail-stop  (at  this 
stage)  or  non-faulty.  Following  is  a  description  of 
this  setting. 

Let  (.Si,Tj)  be  in  the  matching  M  and  let 
(Tj,Ric)  be  a  blocking  edge  for  (S,,T^)  (a 
blocking  edge  must  exist  because  (Si,Tj)  is  bad). 
Then  we  set  the  edges  (Si.Tj)  and  (7^,/?*)  to  be 
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non-faulty  and  set  all  other  outgoing  edges  of  of 
Si  and  Tj  to  be  fail-stop.  We  do  the  same  for 
every  routing  edge  of  the  matching  M  (note  that 
the  Rt's  chosen  by  this  procedure  do  not  have  to 
be  distinct).  In  addition,  all  (incoming  and 
outgoing)  edges  of  starters  or  transmitters,  not 
participating  in  the  matching,  are  set  to  be  non- 
faulty.  Finally,  all  (incoming  and  outgoing)  edges 
of  the  receivers  are  non-faulty. 

We  first  show  that  the  graph  composed  of 
the  non-faulty  edges  is  strongly  connected.  We 
show  for  each  vertex  in  G  a  directed  circuit, 
composed  of  non-faulty  edges,  containing  the 
initiator  7.  Each  vertex  which  participates  in  one 
of  the  paths  /  — >  S,-  — >  Tj  — >  /?*  mentioned  above, 
participates  in  the  directed  circuit 
I  —^Si—*  Tj  —»/?*—»/.  Any  other  vertex 
V  e  V-  {/}  which  does  not  participate  in  any  of 
these  circuits,  participates  in  the  directed  circuit 
/  — >  V  7  which  consists  of  its  auxiliary  edges. 

We  now  show  that  in  this  setting.  Global 
Computation  can  not  be  performed.  Suppose  we 
release  now  all  the  delayed  edges.  Clearly, 
messages  sent  over  the  delayed  edge  (7,5/)  and 
arriving  at  5/  can  only  cause  delivery  of  messages 
over  (Si,Tj),  since  this  is  the  only  non-faulty 
outgoing  edge  of  5/.  The  transmitter  Tj  (which 
(S/.Ty)  is  its  only  non-faulty  incoming  edge) 
would  not  send  a  message  over  the  blocking  edge 
(Tj,Rk)  which  is  its  only  non-faulty  outgoing 
edge.  Therefore,  no  message  is  sent  from  a 
vertex,  participates  in  the  matching  M,  to  any 
vertex  outside  the  matching.  Since  this  process 
started  from  a  stable  state  no  other  vertex  in  the 
network  can  send  messages.  Clearly,  Global 
Computation  is  not  completed,  because  the 
initiator  received  no  information  through  this 
process.  Thus,  we  have  reached  a  contradiction. 
□ 

Proof  of  Claim  3.1:  By  Claim  3.2  we  get  that 
there  is  a  delayed  edge  (7,5/)  such  that  5/  has  at 
least  n  good  outgoing  routing  edges  (out  of  its  2n 
outgoing  routing  edges).  When  releasing  the 
delayed  edge  (7,5/),  the  vertex  5/  (among  other 
things)  sends  messages  over  its  n  good  outgoing 
routing  edges  before  getting  any  other  messages 
(we  postpone  delivery  of  other  messages  because 
(delivering)  otherwise  we  are  not  guaranteed  that 
5/  would  send  these  messages).  Each  transmitter 
at  the  end  of  such  good  edges  sends  messages 
over  all  its  n  outgoing  charge  edges  (as 
guaranteed  by  the  definition  of  a  good  edge). 
Therefore,  we  have  at  least  messages  sent 
over  charge  edges  before  the  next  stable  state  is 


reached.  □ 

By  Claim  3.1  we  get  that  the  release  of  an 
appropriate  edge  (by  the  scheduler),  each  time  the 
execution  reaches  a  stable  state,  causes  sending  at 
least  messages  over  charge  edges  before 
reaching  a  stable  stale  again.  Since  we  have  n 
trigger  edges,  during  the  whole  execution  of 
algorithm  A,  at  least  messages  are  sent  over 
charge  edges. 

Theorem  1:  Let  FI  be  an  arbitrary  protocol,  for 
computing  the  sum  (or  any  other  input-sensitive 
function)  of  the  local  values  residing  in  the 
processors  of  a  complete  network  with  n 
processors  and  uni-directional  links  (which  may 
fail-stop  in  an  undetectable  manner).  Then  thoe 
exists  an  execution  of  FI,  on  the  network,  in 
which  messages  are  sent. 


4.  A  LOWER  BOUND  FOR  THE  GENERAL 
CASE 


The  result  of  the  previous  section  can  be 
restated  as  follows:  The  message  complexity  of 
Global  Computation  on  dense  graphs  with  n 
vertices  and  m  =  G(n^)  edges,  is  Sl(n-m).  Our 
aim  is  to  generalize  this  result  to  graphs  which  are 
sparser.  Namely,  our  aim  is  to  prove  that  for 
every  n  and  m=m{n)  the  message  complexity  of 
Global  Computation  is  n(n-m).  Actually,  we 


only  obtain  a  ^2 


n-m 

L(n) 


bound  where  L  is  a 


V.  y 

polylogarithmic  function.  To  this  end  we  modify 
the  graph  used  in  the  previous  section  as  follows. 


Instead  of  having  0(/i^)  charge  edges  we  have 
only  0(m)  such  edges  which  connect  the  0(n) 

transmitters  with  the  —  receivers.  In  addition, 
n 

we  need  to  decrease  the  number  of  routing  edges. 
Loosely  speaking,  this  is  done  by  replacing  the 


O  (n^)  routing  edges  by  a  sparse  routing  gadget, 
which  is  based  on  the  sparse  routing  graph 
presenting  below. 


4.1  The  sparse  routing  graph 

A  sparse  routing  graph  of  size  n  is  an 
acyclic  directed  graph  which  has  n  vertices  with 
indegree  zero  called  sources,  and  2n  vertices  with 
outdegree  zero  called  targets.  All  other  vertices 
in  the  graph  are  called  intermediate  vertices.  In 
addition  the  graph  should  satisfy  the  following 
routing  properties: 
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(1)  From  every  source  of  ihe  graph  there  exists 
a  unique  directed  path  to  each  target. 
Furthermore,  the  paths  from  each  source,  to 
all  targets,  form  a  directed  tree  (in  which 
the  source  vertex  is  its  root). 

(2)  The  graph  contains  at  most  2n-log2(4rt) 
vertices  and  at  most  4/i-log2(4/i)  edges. 

(3)  Suppose  that  each  source  randomly  selects 
(uniformly  and  independently)  one  of  the 
2n  possible  targets.  Then  the  probability 
that  one  of  the  induced  source-target  paths 
crosses  more  than  21og2(n)  of  the  other 
paths  is  bounded  above  by  ‘/i  (In  case  two 
paths  share  i  of  their  vertices  we  say  that 
they  cross  each  other  i  times). 

Sparse  routing  graphs  do  exists  (see 
Appendix).  We  remark  that  the  choice  of 
constants  in  the  above  definition  is  not  essential  to 
the  lower  bound  proven  below. 

4.2  The  graph  for  which  the  lower  bound  is 
proven 

Let  A:  =  log2(/t).  We  now  describe  the 
sparse  gadget  (based  on  the  sparse  routing  graph). 
We  first  replace  every  non-source  in  the  sparse 
routing  graph  by  D  =  51og2(/i)  =  5A:^  vertices. 
Then  we  replace  each  directed  edge  of  the  sparse 
routing  graph  by  the  set  of  directed  edges 
between  all  pairs  of  vertices  corresponding  to  the 
original  edge.  This  completes  the  construction  of 
one  layer.  Note  that  the  layer  contains  many 
directed  paths  which  correspond  to  a  single 
source-target  path  in  the  original  sparse  routing 
graph.  Any  such  path  is  called  a  routing  path  and 
the  set  of  these  paths  is  called  a  super-path.  Our 
gadget  is  composed  of  k  layers  which  share  their 
sources  (i.e.,  the  gadget  has  n  sources  attached  to 
k  distinct  layers). 

Using  this  gadget  we  can  present  the 
communication  graph  G(y,E)  which  we  use  in 
our  proof.  Again  we  denote  the  predetermined 
initiator  by  /.  The  n  sources  of  the  gadget  are  the 
starters  denoted  by  .  The  2k-D  n  targets 

of  the  gadget  are  the  transmitters  denoted  by 

We  also  have  in  G  the  set  of  — 

.  n 

receivers  denoted  by  [Ri}T=\-  The  graph  also 
contain  the  internal  vertices  of  the  sparse  routing 
gadget  The  edges  of  G  (in  addition  to  the 
internal  edges  of  the  gadget)  are  as  before:  For 
every  starter  S,-,  we  have  the  edge  (/,S,).  We  call 
these  edges  trigger  edges.  For  every  transmitter 
Ti  and  every  receiver  Rj,  we  have  the  edge 
(Ji,Rj).  We  call  these  edges  charge  edges. 


Again  we  also  have  for  every  vertex  in  the  graph, 
ve  V-{/},  the  edges  (/,v)  and  (v,/).  These 
edges  are  called  auxiliary  edges  (we  stress  that 
the  internal  vertices  of  the  gadget  also  have 
auxiliary  edges).  Note  that  in  the  grai^  G  we 
have  routing  paths,  from  starters  to  transmitters, 
instead  of  the  routing  edges  of  the  grtqih  of 
Section  3. 

Since  the  sparse  routing  graph  contains  at 
most  2nlog2(4n)  vertices  and  at  most 
4nlog2(4n)  edges,  we  get  that  the  grtqih  G 
contains  0(/tlog2(/tp  vertices  and 
O  (mlog2(n))  +  O  (nlogiin))  edges. 

4.3  Lower  bound  argument 

Again,  we  fix  an  algorithm  A  which 
perform  Global  Computation  on  the  grai^  G  and 
"play"  the  role  die  scheduler.  The  scheduler’s 
behavior  here  is  very  similar  to  its  behavior  in 
Section  3.  The  set  of  the  delayed  edges  contains 
again,  at  the  beginning  of  the  execution  of  the 
protocol,  the  n  Uigger  edges.  This  time  we  show 
that  each  release  of  a  delayed  edge  by  the 
scheduler,  in  a  stable  state  of  the  network,  causes 
sending  at  least  (4m  messages  over  charge  edges. 
This  yields  a  Q(R-m)  lower  bound  as  required. 

Claim  4.1:  Suppose  that  the  execution  (of  the 
algorithm)  is  in  a  stable  state  and  that  the  set  of 
delayed  edges  is  not  empty.  Then,  there  exists  a 
delayed  edge  and  a  corresponding  schedule 
(which  starts  by  the  release  of  this  edge)  such  that 
this  schedule  yields  sending  at  least  '4m 
messages  (over  the  charge  edges)  before  the  next 
stable  state  is  reached. 

The  proof  of  this  claim  follows  from  Claim  4.2. 

Definition  4.1:  Suppose  that  the  execution  is  in  a 
stable  state  and  that  (Z.^,)  is  a  trigger  edge  which 
is  delayed  at  this  state.  A  routing  path,  which 
starts  at  5,-  and  ends  at  Tj,  for  some  j,  is  called 
good,  if  releasing  the  trigger  (AS,)  and 
scheduling  message  delivery  only  along  the  path 
I  -*  Si-* - *  Tj,  causes  Tj  to  send  messages 

over  all  its  —  outgoing  charge  edges.  Otherwise 
n 

the  routing  path  5,  —»•••—»  Ty  is  called  bad. 

Note  that  if  the  trigger  edge  (/.S,)  is 
delayed,  the  state  of  the  network  is  stable  and  the 
routing  path  Si-*  -  •■  —* Tj  is  bad,  then  there 
exists  a  charge  edge  (Tj,Rk)  such  that  when 
releasing  the  edge  {I, Si)  (and  scheduling  message 
delivery  only  along  the  path  /  — »  S,-  — »  ■  •  •  — »  Tj) 
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the  vertex,  Tj  does  not  send  any  message  over  the 
charge  edge  (Tj,Rt).  We  call  this  charge  edge  a 
blocking  edge  of  the  routing  path  S,  7y. 

Definition  4.2:  A  super-path  is  called  good  at  the 
current  stable  state  if  at  least  half  of  the  routing 
paths  corresponding  to  it  are  good  (at  the  current 
stable  state).  Otherwise,  the  super-path  is  called 
bad  at  the  current  stable  state. 


Note  that  all  the  routing  paths  corresponding  to  a 
specific  super  path  belong  to  the  same  layer  of  the 
routing  gadget. 


Claim  4.2:  Suppose  that  the  execution  (of  the 
algorithm)  is  in  a  stable  state  and  that  the  set  of 
delayed  edges  is  not  empty.  Then,  there  exists  a 
delayed  edge  (/,5,)  such  that  5,  has  at  least  ‘An 
good  outgoing  super-paths  (out  of  its  Ink 
outgoing  super-paths). 


Proof  of  Claim  4.2:  Let  us  assume  to  the  contrary 
that  the  network  is  at  a  stable  state  and  for  every 

3 

delayed  edge.  (/.S,),  there  are  more  than  —n 


super-paths  in  every  layer  of  the  super  graph 
which  are  bad. 


Consider  the  following  process  on  the  first 
layer  of  the  routing  gadget:  Each  starter,  S,-, 
randomly  selects  one  of  its  outgoing  super-paths 
out  of  the  2n  possible  super-paths  at  ^e  first 

layer.  Since  there  are  more  than  —n  bad 

outgoing  super-paths  for  every  starter  in  this 
layer,  we  get  that  the  probability  that  the  starter 
selects  a  bad  super-path  is  greater  than  y^r.  Hence, 
using  Markov’s  inequality,  we  get  that  with 
probability  grater  than  ‘A,  at  least  half  of  the 
starters  select  bad  super-paths.  By  the  properties 
of  the  super-paths  of  the  layer  we  get  that  the 
probability  that  one  of  the  selected  super-paths 
crosses  more  than  21og2(«)  other  selected 
super-paths  is  bounded  by  ‘A.  Combining  the 
above  two  facts  we  get  that  there  exists  a  possible 
choice  in  which  at  least  half  of  the  starters  select 
bad  super-paths  and  none  of  these  selected 
super-paths  crosses  more  than  21og2(n)  other 
select^  super-paths. 

By  this  process  we  have  selected  bad 
super-paths  fcM-  half  of  the  starters.  Applying  the 
same  process  for  the  remain  starters  on  the 
second  layer  of  the  gadget  yields  bad  super-paths 
for  half  of  them.  Thus,  By  using  all  /:  =  log2(n) 
layers  of  the  gadget  we  can  select  for  each  starter 
a  bad  super-path  in  a  way  that  none  of  these 
selected  super-paths  crosses  more  than  21og2(n) 


other  selected  super-paths. 


Our  next  step  is  to  choose  for  each  starter 
one  of  the  bad  routing  paths,  corresponding  to  its 
bad  super-path,  in  a  way  that  all  these  paths  are 
vertex  disjoinL  This  process  can  be  done 
sequentially  (i.e.,  for  starter  after  starter). 
Suppose  we  try  to  choose  a  routing  path,  for  a 
new  starter,  out  of  the  paths  corresponding  to  its 
super-path.  In  order  to  choose  a  vertex  disjoint 
routing  path  for  the  new  starter,  we  have  to  avoid 
paths  containing  vertices  which  participates  in 
routing  paths  chosen  for  previous  stages.  We  use 
the  fact  that  each  super-path  crosses  at  most 
21og2(rt)  of  the  other  super-^aths.  Each  such 

crossing  rules  out  at  most  a  —  fraction  of  the 


possible  paths  corresponding  to  the  current 

‘7  12 

super-path.  Hence,  at  most  a  21og2(n)"^  =  ‘j 

fraction  of  the  paths  are  ruled  out.  Since,  at  least 
half  of  the  routing  paths  of  the  super-path  are  bad, 
we  can  choose  a  bad  routing  path  which  does  not 
cross  any  of  the  previous  ones. 


With  respect  to  the  above  stable  state,  we 
again  describe  a  schedule  contradicts  the 
correctness  of  algorithm  A.  The  schedule  consists 
of  a  setting  of  the  edges  to  either  fail-stop  (at  this 
stage)  or  non-faulty. 

Let  /  -^Si  -> - >  Ty  be  a  path  formed 

by  the  delayed  edge  (/,S,)  and  the  bad  routing 
path  chosen  for  Si.  (Tj,Rk)  be  a  blocking 
edge  for  that  bad  routing  path.  Then  we  set  the 
edges  of  the  bad  routing  path,  along  with  the  edge 
iTj,Rk),  to  be  non-faulty  and  set  all  other 
outgoing  edges  of  the  vertices  of  the  bad  routing 
path  to  be  fail-stop.  We  do  the  same  for  every 
chosen  bad  routing  path.  Again,  all  edges 
(incoming  and  outgoing  between  vertices  not  on 
any  of  these  paths  are  set  to  be  non-faulty. 

We  first  show  that  the  subgraph  consisting 
only  the  non-faulty  edges  is  strongly  connected. 
We  show  again  for  each  vertex  in  G  a  directed 
circuit,  composed  of  non-faulty  edges,  containing 
the  iniualor  /.  Each  vertex  which  participates  in 
one  of  the  paths  /  -^  Sj  — >  •  *  ■  — >  /?*  where 

5;  Ty  is  the  bad  routing  path  chosen  for 

Si,  participates  in  the  directed  circuit 

/  -^  5,  -4 - >  Ty  -»/?*-»  /.  Any  other  vertex 

V  e  V  -  {/ }  which  does  not  participate  in  any  of 
these  circuits,  participates  in  the  directed  circuit 
/  -4  V  -»  /  which  consists  of  its  auxiliary  edges. 


We  now  show  that  in  this  setting.  Global 
Compulation  can  not  be  performed.  Suppose  we 
release  now  all  the  delayed  edges.  Clearly, 
messages  sent  over  the  delayed  edge  (/,5,)  and 
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arriving  at  Si  can  only  cause  delivery  of  messages 

over  the  (chosen)  path  5,-  -» - » Tj,  since  it 

composed  of  the  only  non-faully  edges  reachable 
firom  Si.  The  transmitter  Tj  (which  the  last  edge 
of  this  path  is  its  only  non-faulty  incoming  edge) 
would  not  send  message  over  the  blocking  edge 
(Jj,Rk)  which  is  its  only  non-faulty  outgoing 
edge.  Therefore,  no  message  is  sent  from  a 
vertex,  participates  in  one  of  the  paths 
constructed  above,  to  any  vertex  outside  these 
paths.  Since  this  process  started  from  a  stable 
state  no  other  vertex  in  the  network  can  send 
messages.  Clearly,  Global  Computation  is  not 
completed,  because  the  initiator  received  no 
information  through  this  process.  Thus,  we  have 
reached  a  contradiction.  □ 

Proof  of  Claim  4.1:  By  Claim  4.2  we  get  that 
there  is  a  delayed  edge  (/,5i)  such  that  5,  has  at 
least  ‘An  good  outgoing  super-paths.  Consider 
randomly  choosing  for  these  super-paths 
corresponding  routing  paths  in  a  way  which 
forms  a  directed  tree.  In  other  words,  consider  the 
directed  "super-tree"  corresponding  to  the  set  of 
good  super-paths  and  select  uniformly  a  tree 
corresponding  to  this  super-tree.  Every  routing 
path  in  that  tree  is  good  with  probability  at  least 
‘A,  Therefore,  it  is  possible  to  construct  such  tree 
in  which  at  least  half  of  the  paths  are  good.  The 
schedule  corresponds  to  the  delayed  edge  (/.Si)  is 
described  below.  When  releasing  the  delayed 
edge  (/,5i),  vertex  5,-  sends  messages  over  its 

‘A'^=‘An  outgoing  good  routing  paths  (which 

participates  in  the  directed  tree).  Each  transmitter 
at  the  end  of  such  good  routing  path  sends 

messages  over  all  its  —  outgoing  charge  edges 
n 

(as  guaranteed  by  the  definition  of  good  routing 
path).  Therefore,  at  least  ‘Am  messages  are  sent 
ovCT  charge  edges  during  this  schedule  (before 
reaching  the  next  stable  state).  □ 


and  m  uni-directional  links  (which  may  fail-stop 
in  an  undetectable  manner).  Then  diere  exists  an 
execution  of  FI,  on  such  a  network,  in  which 

r  ^ 


Q 


n-m 

loglin) 

J 


messages  are  sent 


We  get  the  constant  7  in  the  above  bound, 
since  the  grai^  G  contains  O  (nlog2(n))  vertices 
and  0(mlog2(n))  +  0(nlog2(«))  edges.  For 


nm 


bound 


m>«log!(n)  wcgetanft|^j^g,^^^j 

by  a  mere  substitution.  For  m  <nTog|(n)  (  yet 
m  >  (l-i-e)-n  for  some  constant  e>0)  a  ^ghtly 
more  careful  argument  yields  the  stated  bound. 
We  believe  that  the  constants  (in  the  exponent  of 
the  logarithmic  factors)  can  be  improved 
significantly. 
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By  Claim  4.1  we  get  that  the  release  of  an 
appropriate  edge  (by  the  scheduler)  each  time  the 
execution  reaches  a  stable  state,  causes  sending  at 
least  ‘Am  messages  over  charge  edges  before 
reaching  a  stable  state  again.  Since  we  have  n 
trigger  edges,  during  the  whole  execution  of 
algorithm  A,  at  least  ‘An-m  messages  are  sent 
over  charge  edges. 

Theorem  2:  Let  IT  be  an  arbitrary  protocol,  for 
computing  the  sum  (or  any  other  input-sensitive 
function)  of  the  local  values  residing  in  the 
processors  of  some  network  with  a  processors 


APPENDIX 

We  present  a  graph  Gt(V*,£*),  which  is  a 
sparse  routing  graph  of  size  n=2^  (for  some  k). 
The  graph  presented  here  is  based  on  the 
balanced  communication  scheme  presented  in 
[UJ.  Fig.  A.l  shows  the  graph  Gt  for  /t =8. 

The  graph  G^  is  composed  of  k+2  layers 
of  2n  vertices  each.  The  vertices  at  every  layer 
are  labeled  with  binary  strings  of  length  k-t-1. 
Edges  in  Gh  exist  only  between  every  two 
adjacent  layers.  In  particular,  the  following  edges 
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Fig.  A.  1  The  graph  G*  for  « = 8 


exist  between  layer  i  and  layer  i  +  l  (where 
l<i<k  +  \): 

•  From  every  vertex  in  layer  i  to  the  vertex 
with  the  same  label  in  layer  i  + 1 . 

•  From  every  vertex  in  layer  i  to  the  vertex 
(in  layer  i  +  l)  labeled  with  the  same 
binary  string  except  for  the  i-th  bit  which  is 
toggled. 

Note  that  in  the  above  graph  there  are  2n 
source  vertices  (in  layer  1)  and  2n  target  vertices 
(in  layer  k+2).  Therefore,  in  order  to  comply 
with  the  definition  of  routing  graph,  we  drop, 
from  Gt,  all  the  source  vertices  in  which  the  first 
bit  of  their  label  is  1  (along  with  their  outgoing 
edges). 

Note  that  from  every  source  of  the  sparse 
routing  graph  there  exists  a  unique  directed  path 
to  each  target  and  that  the  paths  from  each  source, 
to  all  targets,  form  a  directed  tree. 

We  denote  the  process  of  each  source 
randomly  selects  (uniformly  and  independently) 
one  of  the  2n  possible  targets  by  a  random 
selection. 


Claim  A.1:  In  a  random  selection  on  the  graph 
GkiViitEii),  for  each  target  in  the  probability 
to  be  chosen  c  (or  more)  times  is  bounded  by  2"^. 

Proof  of  Claim  A.  1:  An  exact  calculation  of  this 
probability  gives: 


Performing  an  indices  transformation  7+-1-C 
we  get: 


_1_ 

2n 


H-C-J 


Using  the  combinatoria]  inequality; 
we  can  bound  our  probability  by: 


n‘- 

•  > 

1 

2n 

1 

2n 

j 

'-Tn 

which  is  smaller  than  2~^ 

\n 

Claim  A.2:  In  a  random  selection  on  the  gr^ 
G*(V*,E*),  for  every  vertex  (not  necessarily  a 
target  vertex)  the  probability  of  participating  in  c 
(or  more)  of  the  induced  source-target  paths  is 
bounded  by  2“*. 

Proof  of  Claim  A.2:  Consider  a  random  selection 
on  the  graph  G^.  By  dropping  all  the  vertices  in 
the  last  1  layers  along  with  their  incoming  edges, 
we  get  2'  separated  subgraphs  (each  identical  to 
Gk-i)-  ffie  original  random  selection  on  the 
graph  Gk,  induces  random  selections  on  each  one 
of  these  subgrtqihs.  By  Claim  A.1,  we  get  that  for 
each  one  of  the  new  targets,  the  probability  to  be 
chosen  c  (or  more)  times  is  bounded  by  2~‘. 
Therefore,  in  the  original  random  selection,  the 
probability  of  its  participating  in  c  (or  more)  of 
the  induced  source-target  paths  is  also  bounded 
by  2-^.  □ 

By  noting  that  the  graph  Gk  contains  2/ilog2(2/() 
vertices  which  are  not  source  vertices  and  that  the 
length  of  each  unique  path  is  bounded  by  ik  + 1, 
we  can  derive  From  Claim  A.2  the  following: 

Corollary  A.1:  In  a  random  selection  on  the  graph 
Gk,  the  probability  that  there  exists  a  vertex  in  the 
graph  which  participates  in  more  then  21og2(n) 
of  the  induced  selected  paths  is  bounded  by  ‘/z. 

Corollary  A.2:  In  a  random  selection  on  the  graph 
Gk,  the  probability  that  one  of  the  induced 
selected  paths  crosses  more  than  21og2(n)  other 
selected  paths  is  bounded  by  ’/t. 

Note  that  the  graph  Gk  contains  less  than 
2nlog2(4n)  vertices  and  less  than  4nlog2(4n) 
edges  and  therefore  Gk  is  a  sparse  routing  graph. 
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1  Abstract 

Self-stabilizing  systems  have  been  proposed  as  a  de¬ 
sirable  method  of  achieving  fault  tolerance.  They  are 
guareinteed  to  eventually  eliminate  any  initial  set  of 
errors.  This  also  implies  that  infrequent  errors  can 
be  de2dt  with.  For  more  details  see  [1-7].  In  this  pa¬ 
per  we  study  deterministic  self-stabilizing  algorithms 
for  leader  election  in  rings.  We  have  two  sets  of  con¬ 
tributions.  First,  we  introduce  the  formal  definition 
of  an  observer  at  each  location:  a  local  process  that 
can  detect  correctness,  but  cannot  influence  the  pro¬ 
tocol.  Every  self-stabilizing  algorithm  can  have  such 
associated  observers.  We  believe  that  this  is  a  good 
abstraction.  We  also  claim  that  some  such  notion  is 
necessary  to  make  self-stabilizing  protocols  useful. 

The  notion  of  an  observer  suggests  a  natural  ques¬ 
tion  to  anyone  familiar  with  the  P  vs.  NP  question; 
are  there  situations  in  which  it  is  easier  to  detect  sta¬ 
bility  than  to  achieve  it? 

We  exhibit  a  somewhat  contrived  problem  where 
this  is  indeed  the  case.  Our  second  contribution  is 
a  careful  study  of  deterministic  leader  election  algo¬ 
rithms  on  uniform  rings  of  processors.  We  show  that 
the  class  of  protocols  based  on  the  general  ideas  of 
Burns  and  Pachl  [2]  require  0(n^)  steps  in  the  worst 
case  to  become  stable.  We  present  a  protocol  for  this 
problem  that  detects  stability  in  0(n^)  steps,  and 
uses  only  five  extra  bits.  Thus,  for  this  class  of  algo¬ 
rithms  verification  is  easier  than  computation. 

We  also  characterize  exactly  the  memory  require¬ 
ments  of  these  algorithms.  The  algorithm  in  [2]  re- 
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quires  n(p^/lnp)  states  per  processor  for  a  ring  of 
size  p.  We  give  a  combinatorial  characterization  of 
the  exact  number  of  states  needed  for  similar  proto- 
cols  as  0(pR(p)),  where  R(n)  <  0(\/n/lnnln Inn)  is 
a  Ramsey  theoretic  function. 

2  Introduction 

Self-stabilizing  protocols  are  an  elegant  methodol¬ 
ogy  to  achieve  fault  tolerance.  While  there  has 
been  a  large  number  of  recent  papers  on  the  topic, 
there  are  several  problems  with  the  technique.  Self¬ 
stabilization  requires  that  the  system  enter  the  de¬ 
sired  stable  set  of  configurations,  no  matter  what  the 
initial  state  of  individual  processors  (and  no  matter 
in  which  order  enabled  processors  take  steps).  It  fol¬ 
lows  that  although  stability  is  eventually  attained,  a 
processor  cannot  determine  whether  this  has  yet  oc¬ 
curred  (since  part  of  the  incorrect  initial  configuration 
could  be  the  setting  of  the  components  of  the  state 
of  the  processor  that  indicate  whether  the  system  is 
stable.) 

As  a  concrete  example,  consider  a  token  ring.  Once 
stability  has  been  attained,  there  is  a  single  token  in 
the  ring,  and  it  can  be  used  for  granting  control  to 
the  processor  that  has  it.  Before  the  ring  is  stable, 
there  may  be  several  tokens  that  are  used  by  the  the 
self-stabilizing  protocol.  During  this  period,  a  pro¬ 
cessor  that  has  a  token  may  not  assume  that  it  is 
the  unique  enabled  process.  Since  processors  cannot 
know  whether  the  ring  is  stable,  all  2u:tions  must  be 
tentative,  even  when  the  ring  is  already  in  a  legiti¬ 
mate  configuration. 

We  propose  a  new  model,  in  which  it  is  meaningful 
to  say  that  “a  processor  knows  that  the  ring  is  stable” . 
Formally,  this  statement  will  mean  the  following: 

the  state  set  of  each  processor  is  of  the 
form  Statessp  x  Observe  {SSP  stands  for 
self-stabilizing  protocol).  There  is  a  subset 
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Stable  of  Observe  such  that  for  a  correct 
protocol,  if  the  ring  stabilizes,  eventually  ev¬ 
ery  processor  will  be  in  a  state  in  the  set 
States sp  ^  Stable.  Conversely,  if  a  proces¬ 
sor  is  in  state  belonging  to  this  set,  the  ring 
is  stable  (provided  no  new  errors  have  oc- 
cured). 

In  order  for  the  definition  to  make  sense,  we  stipu¬ 
late  that 

•  There  is  a  value  Init  €  Observe  such  that  all 
processors  start  in  a  state  in  Statessp  x  Init. 
The  daemon  cannot  change  the  Observe  compo¬ 
nent  of  the  state. 

•  The  transition  relation  does  not  depend  on  the 
Observe  component.  It  is  solely  a  function  of 
Statessp- 

This  formalizes  the  notion  that  there  is  a  compo¬ 
nent  of  the  processor  state  that  is  used  solely  for  ob¬ 
servation.  It  cannot  influence  the  protocol,  but  it  is 
not  affected  by  the  daemon  -  so  it  can  always  have  a 
meaning. 

We  shall  see  that  this  can  be  implemented  (in  the 
case  of  uniform  rings)  by  very  few  extra  bits,  and 
thus  it  might  be  of  practical  use.  One  might  think  of 
scheme  as  an  outside  observer  that  looks  at  the  pro¬ 
cessor  and  determines  when  the  ring  is  stable.  There 
may  be  a  delay  between  the  time  the  ring  becomes 
actually  stable,  and  the  time  the  observer  detects 
it.  Since  the  protocol  cannot  be  influenced  by  the 
observer,  the  observation  routine  runs  independently 
from  the  protocol;  thus  the  assumption  that  the  ob¬ 
server  starts  in  a  (‘correct’)  unique  initial  state  is  not 
very  restrictive. 

Of  course,  the  observer  indicating  the  ring  to  be 
stable  means  only  that  at  some  time  in  the  past  the 
ring  was  stable.  Provided  that  no  new  errors  oc¬ 
curred,  the  ring  is  currently  stable. 

There  is  a  wealth  of  interesting  questions  about  this 
model.  For  example 

•  How  easily  can  stability  be  detected? 

•  How  many  extra  states  are  needed? 

•  Can  this  be  made  useful? 

In  the  rest  ot  this  paper  we  deal  with  determinis¬ 
tic  leader  election  protocols  in  uniform  (anonymous) 
rings.  As  Dijkstra  [4]  pointed  out,  there  is  no  self- 
stabilizing  solution  for  n  identical  processors  in  a  ring, 
if  n  is  composite.  However,  if  n  is  prime.  Burns  and 
Pachl  [2]  exhibited  self-stabilizing  election  protocols, 
with  as  little  as  0{n^/  In  n)  states  per  processor. 


We  will  call  protocols  based  on  the  elegant  ideas 
of  [2]  BP  type  protocols.  We  show  that  a  BP  type 
protocol  is  guaranteed  to  attain  stability  after  4n^ 
moves,  and  that  there  is  an  adversary  strategy  for 
the  daemon  that  makes  the  protocol  run  ^(n  —  1)^ 
steps  before  stabilization. 

We  then  study  stability  detection  in  this  model. 
There  is  an  obvious  simple  observer  strategy;  use  a 
counter  that  is  initially  0,  and  is  incremented  at  every 
move.  When  its  value  reaches  an  appropriate  number 
(of  0{n^))  we  know  that  the  ring  is  stable.  This  strat¬ 
egy  has  many  faults;  it  uses  too  many  states,  and  it 
is  nonadaptive  -  it  uses  a  large  amount  of  time  even 
for  a  ring  that  is  initially  stable.  What  our  bounds 
show  is  that  after  0(n®)  steps  the  ring  becomes  sta¬ 
ble.  Since  the  ring  is  asynchronous  it  is  not  even  clear 
in  principle  how  many  steps  need  to  be  taken  before 
every  processor  takes  0(n®)  steps  (in  fact  O(n^)  suf¬ 
fice.)  Note  that  it  is  unclear  how  to  get  a  global  count 
-  the  counters  belong  to  the  observer,  so  they  cannot 
participate  in  the  protocol.  Thus,  we  cannot  trans¬ 
mit  individual  counters  (unless  we  were  willing  to  use 
ones  that  can  be  altered  by  the  adversary  -  in  which 
case  it  is  unclear  how  we  could  keep  them  correct.) 

We  show  that  there  is  a  much  better  observer  strat¬ 
egy  by  presenting  a  protocol  which  correctly  detects 
the  stability  of  ring  after  0(n*)  moves  from  the  time 
the  ring  is  stabilized.  The  observer  uses  only  5  bits. 
Finally,  we  show  that  the  0(n*)  detection  delay  is 
necessary. 

These  two  results  show  that  in  this  case  checking 
stability  is  easier  than  achieving  it. 

Finally,  we  characterize  the  number  of  states  per 
processor  that  are  necessary  for  a  correct  implemen¬ 
tation  of  a  BP  type  protocol.  Let  R{n)  be  an  in¬ 
teger  with  the  following  property;  it  is  possible  to 
color  all  integers  in  D  =  {l,2,  -  -,n  —  1}  with  R{n) 
colors,  such  that  for  any  subset  {ai,a2.  •.<*»}  of  T>, 
1  <  ifc  <  n,  if  all  Oi  have  same  color  then  ^  ”• 

There  is  a  BP  type  protocol  with  nR{n)  states.  In 
particular,  the  optimum  protocol  has  as  many  states 
as  the  smallest  possible  number  of  colors.  Determin¬ 
ing  this  number  is  an  interesting  open  problem  of 
combinatorics.  We  have  no  nontrivial  lower  bounds. 
We  are  able  to  produce  a  coloring  with  0(\/i5vfeTirn) 
colors.  This  yields  0{ny/^  minn )  states,  as  op¬ 
posed  to  0(n*/lnn)  in  [2]. 


3  BP  type  Deterministic  Self- 
Stabilizing 

Protocols  for  Asynchronous 
Rings  of  Prime  Size 

We  use  the  formal  definitions  of  Burns  and  Pachl.  For 
simplicity,  we  deal  with  unidirectional  rings.  For  the 
more  general  definition,  refer  to  [2]  . 

An  n-processor  ring  is  a  four-tuple,  S  = 
(G,  /J,  r.  A)  where  G  is  a  cycle,  the  processors  are  de¬ 
noted  by  0, 1,  •  ■  • ,  n  —  1  in  order  around  the  cycle,  and 
R  orders  the  processors  so  that  the  first  (leftward) 
neighbor  of  processor  i  is  processor  i  —  1  (mod  n), 
and  the  second  (rightward)  neighbor  is  processor  i  -t- 1 
(mod  n).  r  =  So  X  •  •  •  X  is  a  set  of  configura¬ 
tions  with  finite  sets  So,  •  •  ■ ,  S„_i  ,A  =  6oi  ■  ■  •  .^n-i 
is  a  sequence  of  transition  relations.  We  refer  to  E, 
and  6i  as  the  state  set  and  transition  relation,  respec¬ 
tively,  of  processor  i,  or  P,  .  Each  is  a  relation  from 
Ei_i  X  St  to  St.  Thus  the  ring  is  unidirectional. 

Let  7  =  (no. Oil  •  •  lOn-i)  6  F  be  a  configuration. 
We  denote  by  6i(a,_i,aj)  the  set  {x|(aj_i,ai,x)  G 
Then  we  define  Si(y)  to  be 

{oo.oi.  •,ai-i,i,ai+i,  ••  • ,  a„_i)  |  x  G  ^(ai-i,<ii)}- 

Processor  i  is  enabled  at  7  if  ^<(7)  is  not  empty.  If 
processor  i  is  enabled  at  7,  and  7'  G  ^«(7),  we  write 
7-^7'  and  say  that  7  7'  is  a  step  of  P, .  The  no¬ 

tation  7  — ♦  7'  means  that  7-^7'  for  some  i.  A  com¬ 
putation  of  a  system  is  a  finite  or  infinite  sequence 
7o7i  •  ■  •  such  that  yj^i  —*  yj  for  all  j.  Thus,  we 
consider  only  the  serial  computations.  Only  one  pro¬ 
cessor  takes  a  step  at  a  time,  and  each  step  in  the 
computation  depends  on  the  configuration  resulting 
from  the  previous  step  in  the  sequence.  If  more  than 
one  processor  is  enabled  at  a  configuration  7,  then 
the  central  demon  (scheduler)  will  select  one  of  the 
enabled  processors  to  take  a  step. 

Definition  1.  A  ring  5  =  (G,  R,  F,  A)  is  self- 
stabilizing  if  and  only  if  there  is  a  set  A  C  F,  called  the 
legitimate  configurations  of  S,  such  that  the  following 
conditions  are  satisfied; 

1.  [No  Deadlock]  For  every  7  G  F  ther  is  a  7'  G  F 
such  that  7  — »  7' . 

2.  [Closure]  For  every  A  G  A,  every  X'  such  that 
A  — »  A'  is  in  A. 

3.  [No  Livelock]  Every  infinite  computation  of  S 
contains  a  configuration  in  A. 

4.  [Mutual  Exclusion]  For  every  A  G  A,  exactly  one 
processor  is  enabled. 


5.  [Fairness]  For  every  processor  i.  every  infinite 
computation  consisting  of  configurations  in  A 
contains  an  infinite  number  of  steps  by  P,. 

Because  the  ring  is  unidirectional,  we  write  ab—*d 
to  express  that  a  processor  in  state  6  (indicated  by 
underlining)  is  enabled  to  step  to  state  d  when  it  has  a 
processor  in  state  a  on  the  left.  A  protocol  is  specified 
by  such  rules  and  by  conditions  under  which  they  can 
be  applied  (i.e.,  when  the  processor  corresponding  to 
state  b  is  enabled). 

A  ring  is  uniform  if  So  =  Ei  =  •  •  ■  =  En-i,  and 
^0  =  =  •  •  •  =  ^n-i-  Since  a  solution  is  impossible 

in  uniform  rings  of  composite  size,  we  assume  n  to  be 
a  prime. 

Let  the  size  of  the  ring  be  n.  The  states  of  pro¬ 
cessor  are  composed  of  two  parts,  the  label  and  the 
tag  where  labels  range  over  L  —  {0, 1,  •  •  ■ ,  n  —  2}  and 
tags  over  T  =  {0, 1,2,  ■  •  ■,n}.  We  write  label. tag  to 
denote  state  with  label  label  and  tag  tag. 

The  protocol  is  defined  by  the  following  two  rules 
in  which  the  expressions  a  -(- 1  and  b  —  a  are  computed 
modulo  n  —  1 . 

Rule  A.  If  6  ^  a  -1-  1  and  (6  ^  0  or  <  =  0  or  <  ^ 
f{b  -  a)  or  t  <  u),  then 

a.t  bM  — » (a  -t-  l)./(6  -  a) 

Rule  B.  If  <  u  and  a  -f  1  ^  0,  then 

a  t  (a  +  l).u  — ►  (a  -I-  l).l 

Let  A  be  the  set  of  all  cyclic  permutations  of  config¬ 
urations  of  the  following  form  (underlining  indicates 
the  enabled  state): 

0.0  1.0  •  •  (a  -  1).0  a.O  oj  (a  -)-  1).0  •  ■  •  (n  -  2).0 
for  a  =  0, 1,  •  •  • ,  n  -  2. 

We  say  that  the  protocol  is  defined  by  function  /. 

Lemma  1  If  f  is  any  function  such  that  f{k)  =  0  iff 
fc  =  0,  then  the  protocol  defined  by  f  satisfies  condi¬ 
tions  S,3,4,  and  5  0/ Definition  1.  [2] 

Thus,  if  a  function  /  as  above  also  satisfies  condi¬ 
tion  1  [No  Deadlock],  then  the  protocol  defined  by  / 
is  a  correct.  We  shall  call  such  functions  good. 

4  Number  of  Steps  Used  by 
BP  Type  Protocols 

4.1  The  Upper  Bound 

Theorem  1  For  any  central  demon  and  for  any  tnt- 
tial  configuration,  the  system  will  enter  a  legitimate 
configuration  in  at  most  4n®  steps. 
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The  proof  of  the  theorem  is  by  a  careful  analysis  of 
the  kinds  of  configurations  the  ring  can  be  in.  More 
precisely,  we  define  the  notions  of  gap  and  segment  in 
the  sequence  of  contiguous  processors  that  is  a  con¬ 
figuration  of  the  system.  We  then  examine  the  action 
of  both  Rule  A  and  Rule  B  on  these  sequences. 

Definition.  Let  Pi  and  Pj+j  be  two  consecutive  pro¬ 
cessors  with  states  a.t  and  6.u  in  a  configuration  7. 
If  6  ^  a  -I-  1  (mod  n  —  1),  then  Pi  and  Pi+i  form  a 
gap  of  7;  the  gap  size  (of  Pi  at  7)  is  defined  to  be 
g(i,'f)  =  b  —  a  (modn  — 1). 

Definition.  A  segment  of  7  is  a  maximal 
cyclically  contiguous  sequence  of  processors  s  = 
{Pi,Pi+u--,P:-u  Pj)  which  contains  no  gaps;  the 
gap  size  of  s  is  7).  We  will  call  Pi  the  head  of  s 
and  Pj  the  tail  of  s. 

A  segment  has  the  form:  a.  ♦  (a  + 1).  ♦  •  •  •  (a  -J-Ar).#. 
The  head  has  label  a,  and  the  tail  has  label  (a  -(-  k). 

If  Rule  A  is  applied  at  the  head  of  a  segment  of 
length  at  least  2,  then  both  that  segment  and  the  one 
to  its  left  survive,  and  the  size  of  the  gap  between 
them  remains  the  same.  Neither  Rule  A  nor  Rule  B 
can  increase  the  number  of  segments. 

Note  that  Rule  B  is  applied  inside  a  segment  and 
it  does  not  pass  a  tag  across  a  state  with  label  0.  By 
changing  the  tail  of  a  segment  at  most  n  —  1  times, 
the  tail  will  have  state  0. /(<;),  where  g  is  the  gap  size 
between  the  segment  and  the  one  on  its  right. 

Lemma  2  Assume  there  are  k  (>  1)  segments  at 
configuration  7.  Then  after  at  most  (n  -f-  m)k  steps 
of  Rule  A  under  any  central  demon,  either  there  are 
at  most  ib  —  1  segments  left  or  all  k  segments  change 
their  tails  at  least  m  times. 

Proof.  At  7,  let  the  k  segments  be  si,S2,  •  •  •.s*, 
with  lengths  /i ,  /2,  ■  •  • ,  I*  respectively.  Consider  seg¬ 
ment  St.  Define  the  segment  length  array  at  configu¬ 
ration  7  as 

(/*,/*-!, ■■•./ow 

If  after  p  steps  of  Rule  A,  the  configuration  is  71  and 
the  tail  of  segment  st  is  not  changed,  then  it  is  not 
difficult  to  see  that: 

I; 

P  =  -^) 

»=i 

where  (tt,ti_i,  •  •  ■  ,<i)(7i]  is  the  segment  length  array 
at  7i. 

Assume  segment  st  changed  its  tail  at  configura¬ 
tions  Pi,02,  -  ■  •  ,0m  and  then  entered  configurations 


7ii72.--  i7m  respectively.  Let  the  segment  length 
array  at  ,  1  <  i  <  m  be 

(It.ii  ■  ■  ■  I 

then  the  segment  length  array  at  7i ,  1  <  i  <  m  is 
{ti,k  +  •  •  •  .L,l  —  1)[7»] 

Therefore  the  total  number  of  steps  of  Rule  A  is: 

k{lk  —  ti.t)  +  ik  —  !)(/*-!  —  + - 1- 

{l\  ~  ti,i)  +  1+ 

Hi^i.k  +  1)  “  h.k)  +  (k  —  l)(ti  t-i  —  h,k-i)  +  ■  ■  ■ 
+((ti,i  ~  I)  ~  ^2,1)  +  1+ 

Hih.k  +  1)  —  ts.k)  +  (^  —  l)(t2,t-l  —  t3,k-l)  +  ■  ■  ■ 
+((<2,1  ~  I)  —  <3,l)  +  1  + 

•••  + 

A^((<m-l,lr  +  l)  — <m,*)  +  (^~l)(<m-l,t-l  — - 

+((<m-l.l  ~  1)  ~  <m.l)  +  1 

k 

=  {m-\)k+l  +  '^i(li-tm,i) 

i=l 
k 

<  (m  -  1  +y^»/. 

i=l 

k 

<  (m  -  l)k  +  I  +  ky^lj 

1=1 

<  (m  -I-  n)k 

m 

A  segment  s  of  7  is  well  formed  if  ail  its  tags  are 
equal  to  /(sr(s,7)). 

Lemma  3  If  there  are  k  (>  1)  segments  at  configu¬ 
ration  7,  then  after  at  most  3kn  steps  of  Rule  A  un¬ 
der  any  central  demon,  either  there  are  at  most  k  —  1 
segments  left  or  all  k  segments  are  well  formed. 

Proof.  At  7,  let  the  k  segments  be  si,S2,-  -.si, 
with  length  li,h,  -  respectively.  Let  us  consider 
segment  st.  Assume  st  changed  its  tail  m  times  and 
has  O.f(g)  (g  is  its  gap  size)  as  its  tail  at  configuration 
0  with  segment  length  array  (<t, <t-i,  •  •  •  By 

the  proof  of  the  Lemma  2,  the  protocol  takes 

k 

(m-  l)k-i-  l  +  53*(/i  -<i) 

1=1 
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steps  of  Rule  A.  After  configuration  (3,  any  new  tail 
of  segment  st  will  have  this  same  tag  /(s'),  since  Rule 
B  doesn’t  pass  a  tag  across  label  0.  From  /?,  after  seg¬ 
ment  st_i  changes  its  tail  at  most  tk  times,  segment 
Sfc  will  have  0./(ff)  as  its  head.  So  the  segment  s*  is 
well  formed  this  time,  say,  at  configuration  a  .  By 
Lemma  2,  the  total  steps  of  Rule  A  is  then 

k 

<  (m  —  l)fc  +1-1-  i{li  —  ti)  +  (n  +  tk)k  <  3kn 
i=l 


Lemma  4  If  there  are  I  <  k  <  n  segments  at  con¬ 
figuration  7,  and  all  k  segments  are  well  formed,  then 
after  at  most  2kn  applications  of  Rule  A,  there  will 
be  at  most  k  —  I  segments  left. 

Proof.  At  7,  let  the  k  segments  be  si,S2,  ■  ■  ■  ,Sk, 
and  k  let  the  gap  sizes  be  </i,  <^2.  •  •  •  .fft-  B  all  f(gi) 
are  zero,  then  y  is  already  a  legitimate  configuration. 
WLOG,  assume  f(gi)  >  f(g2)  and  f(gi)  ^  0.  If  the 
tail  of  segment  si  changed  m  times,  then  the  head  of 
segment  S2  changed  m  times  as  well.  But  if  the  head 
of  segment  S2  is  0.f(g2),  it  can’t  be  changed  again 
unless  the  number  of  segments  decreases.  Hence,  the 
tail  of  segment  si  can  be  changed  at  most  n  times. 
By  Lemma  2,  in  at  most  (n  +  n)ifc  applications  of 
Rule  A  there  will  be  at  most  k  -  1  segments  left.  ■ 
The  following  lemma  provides  the  main  tool  to 
bound  the  number  of  applications  of  Rule  A. 

Lemma  5  After  at  most  |n^  applications  of  Rule  A, 
the  system  will  be  in  a  legitimate  configuration. 

Proof.  First  application  of  Rule  A  will  enter  a  con¬ 
figuration  which  has  at  most  a  —  1  segments.  By 
Lemma  3  and  4,  the  number  of  segments  will  de¬ 
crease  from  fc(<  n)  in  at  most  5fcn  steps  of  Rule  A. 
So  the  number  of  applications  of  Rule  A  needed  be¬ 
fore  entering  a  one-segment  configuration  is  at  most 

e 

l  +  5351:n=  -n2(n-  1) 

*=1  ^ 

In  at  most  n  more  applications  of  Rule  A,  the  config¬ 
uration  will  be  legitimate.  ■ 

Now  we  look  at  Rule  B. 

Lemma  6  //  a  segment  s  has  length  I,  then  Rule  B 
can  be  applied  at  most  times  before  it  changes 

its  tail,  and  Rule  B  can  only  be  applied  at  most  I  times 
to  any  new  tail  of  s  before  the  number  of  segments 
decreases. 


Proof.  Let  the  segment  be  a.  ♦  (a  +  1).  ♦  •  •  ■  (a  + 
/—  1).*.  Then  label  (a  +  i)  can  apply  Rule  B  at  most 
i  times,  the  first  part  of  lemma  follows. 

When  the  tail  changes  to  a  +  I,  it  will  have  tag  f(g) 
(g  is  its  gap  size)  initially.  This  tag  can  be  changed 
at  most  /  times.  Any  new  tail  after  a  +  I  will  have 
the  same  tag  f(g)  when  it  joins  segment  s.  Thus  the 
number  of  times  a  new  tail  can  apply  Rule  B  is  the 
same  as  the  times  of  Rule  B  applied  at  a  +  /,  which 
is  at  most  /.  ■ 

Lemma  7  If  there  are  1  <  k  <  n  segments  at  config¬ 
uration  7,  then  after  at  most  |n(n  —  1)  applications 
of  Rule  B,  there  will  be  at  most  ife  —  1  segments  left. 

Proof.  At  7,  let  the  k  segments  be  si,S2,  •  • -.st, 
and  with  length  lithr  ■  ■  Jk  respectively.  For  every 
»,  Rule  B  can  only  apply  to  at  most  n  —  1  new  tails 
of  Si,  because  Rule  B  doesn’t  pass  a  tag  across  label 
0.  Therefore,  by  Lemma  6,  k  will  be  decreased  in  at 
most 

t=l  i=l 

applications  of  Rule  B,  since  Yli=i  /<  =  "■  ■ 

The  following  lemma  concludes  the  proof  for  Rule 
B. 

Lemma  8  After  at  most  |n®  applications  of  Rule  B, 
the  configuration  will  become  legitimate. 

Proof.  Apply  Lemma  7  at  most  n  —  1  times, 
the  configuration  will  have  only  one  segment.  By 
Lemma  6,  one-segment  configuration  can  apply  Rule 
B  at  most  times.  Thus  the  total  number  of 

times  a  step  of  Rule  B  can  be  taken  before  entering 
a  legitimate  configuration  is  at  most; 

^n(n-  l)2  +  ^n(n-  1)  < 

■ 

The  two  main  lemmas  clearly  imply  the  Theorem 

1. 

4.2  The  Lower  Bound 

Theorem  2  For  any  correct  BP  type  protocol  P, 
there  is  a  demon  strategy  and  an  initial  configuration, 
such  that  P  takes  at  least  i(n  —  1)^  steps  to  enter  a 
legitimate  configuration. 

The  proof,  is  by  constructing  a  strategy  for  the 
demon  and  an  initial  configuration  which  forces  the 
protocol  to  run  at  least  5(11  —  1)^  steps  before  stabi¬ 
lization. 
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Let  us  consider  the  initial  configuration 

0.0  0.0  ■  •  ■  M 

The  processor  with  an  underlined  state  is  the  proces- 
K}r  chosen  by  the  demon  to  take  a  move.  A  sequence 
steps  is  as  follows: 


0.0  0.0 

---  O.OMI  O 

[71] 

0.0  0.0 

-  -  -  Ml-0  1.0 

0.0  1.0 

--  -  1.01.0 

2./(n- 

2)  1.0  -  --  1.01.0 

by  Rule  B 

2.0  1.0 

---  l.OU 

[72] 

For  convenience,  we  rewrite  configuration  72  as 
1.0  1.0  •  ••  1.0  LO  2.0  [72] 

From  7i  to  72,  the  system  takes  n  —  1  steps  of  Rule 
4  and  one  step  of  Rule  B.  Similarly,  the  following 
:onfigurations  will  be  reached: 

2.0  2.0  2.0  2.0  3.0 


(n  -  3).0  (n  -  3).0  ■••(«-  3).0  -  3).0  (n  -  2),0 

(n  -  2).0  (n  -  2).0  •  •  •  (n  -  2).0  (n  -  2).0  0./(n  -  2) 

\gain,  from  7i  to  7i+i , »  =  2, 3,  •  •  • ,  n  -  2,  the  system 
:akes  n  - 1  steps  of  Rule  A  .  So  far  we  used  (n  - 1  )(n  - 
1)  steps  of  Rule  A.  After  n  —  1  more  steps  of  Rule  A, 
-he  system  will  enter  a  configuration  y„  in  which  all 
segment  are  well  formed. 


Let  configuration  1  =  2, 3,  •  •  • ,  n  —  2,  be 

0.0  •••  0J9  1-0  2.0  •••  «.0  [0i] 

We  know  that  from  Pi  to  Pi+i,  system  takes  at  least 
(n  —  »)(n  —  1)  steps  of  Rule  A.  Therefore  the  system 
takes  at  least 

^(n  -  i)(n  -  1)  >  hn  -  1)® 
i=3  ^ 

steps  of  Rule  A  to  enter  a  legitimate  configuration. 

5  Stability  Detection  Protocol 

5.1  The  Protocol 

If  the  ring  is  stable,  the  state  of  any  fixed  proces¬ 
sor  increases  by  one  (mod  n)  when  the  processor  is 
enabled  next  time.  If  this  happens  at  a  processor  n 
consecutive  times,  its  observer  should  be  able  to  de¬ 
tect  the  stability.  This  is  the  idea  of  our  detection 
protocol. 

We  present  a  protocol  for  observers  to  detect  the 
stability  of  the  ring.  It  uses  5  extra  bits:  two  bits 
61 , 62  that  are  determined  by  the  last  move  of  the 
processor,  and  a  three-bit  counter  C. 

The  protocol  is: 

The  observer  O  initializes  its  counter  C  to  0.  If 
its  processor  P  takes  a  step  of  Rule  A  of  the  form 
a.t  bjt  — ♦  (a  +  l)./(6  —  a),  P  sets 

1  if  6  =  a 
0  otherwise 


0.0  0.0  •  •  •  0.0  0./(rt  -  2)  l./(n  -  2) 
n  another  two  steps  of  Rule  A: 

0.0  0.0  •  ■  •  0.0  1.0  l./(n  -  2) 

0.00.0  00  1-02.0  [P2] 

3y  taking  (n  —  2)(n  —  3)  steps,  the  system  reaches 
:onfiguration 

'n  -  3).0  -  -  -  (n  -  3).0  (n  -  3).0  (n  -  2).0  0./(n  -  3) 

!(n  —  2)  more  steps  will  let  system  enter  a  strongly 
veil  formed  configuration 


1  if  a -f  1  =  0  (mod 
0  otherwise 

When  61,62  changes,  the  observer  O  updates  its 
counter  C  according  to  the  following  rules: 

C  =  0  if  61  =  0 

Does  Nothing  if  (61  =  1)  A  ((62  =  0  A  C  <  3))  V  C  =  5) 
Increases  C  by  1  if  61  =  1  A  ((62  =  1AC<3)V3<C<{ 

The  observer  knows  that  the  ring  is  stable  when 
C  =  5. 

5.2  The  Correctness 


0.0  -  -  -  0.0  0./(n  -  3)  l./(n  -  3)  2./(n  -  3) 

1  more  steps,  we  have  a  configuration  which  has  n  -  4 
egments  of  length  1  and  one  segment  of  length  4. 

0.0  -  -  -  (LQ  1-02.0  3.0  [Pz] 


The  main  ideas  are  the  following:  C  =  3  when  the 
ring  consists  of  a  single  segment.  When  C  becomes  4 
state  0  has  a  tag  of  0.  The  next  time  C  is  incremented 
all  tags  are  0,  and  the  ring  is  stable.  The  main  diffi¬ 
culty  of  the  proof  is  showing  that  the  counter  behaves 
appropriately  as  the  ring  stabilizes.  This  is  done  by 
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proving  by  induction  on  the  number  of  segments  that 
a  program  invariant  -  a  modular  equation  about  the 
number  of  segment  merges  -  holds. 

Let  a  be  some  configuration  with  k  segments 
>  ^2i '  '  1 clockwise  arranged  on  the  ring.  Sup¬ 
pose  that  7,  y  Me  two  consecutive  configurations 
not  before  a.  If  two  adjacent  segments  Si(7),Sj  (7) 
merge  into  one  segment  at  configuration  7.  We  call 
the  merged  segment  at  configuration  7'  8,(7'),  where 
i  <  j.  If  (7i(7).llj(7)  are  gaps  of  Si(y),Sj{y)  respec¬ 
tively,  then  the  merged  gap  is 

9iW)  =  9i{lf)  +  9iiy)  -  1  (mod  n  -  1) 

If  the  step  7  — ►  7'  is  not  a  merging  step,  then 

Si(7')  =  «i(T).  9iW)  =  9i{y)  Vi 
In  the  following  definitions,  P  is  a  fixed  processor. 

Definition.  A  valid  step  of  P  is  a  step  of  Rule  A 
taken  by  P  and  of  the  form 

a.f  oji  — +  (a -b  1).0.  ■ 


Definition.  M(a,7,  Si(7))  is  the  number  of  merges 
associated  with  segment  81(7)  between  configurations 
a  and  7  with  respect  to  process  P,  its  value  is  defined 
by  the  following  rules: 

1.  M{a,a,Si(a))  =  0  fot  i  =  1,2,  •  ■  ■ ,  k 

2.  Let  7,y  be  two  consecutive  configurations  not 
before  a.  Assume  that  P  is  in  segment  si{‘y')  at 
configuration  7'. 

If  /  =  1  then  M(a,7',s,(7'))  =  0  for  all  i. 

If  the  step  7  — ►  7'  is  not  a  merging  step,  then 

=  M(o,7,  8,(7))  for  all  existing  i. 

If  the  step  7  — ►  7'  merges  two  segments  8i(7) 
and  8j(7)  to  8j(y)  (i  <  j),  then 

=  Af(a,7,8h(7))  for  h  #  i,j. 


and 


f  M(a,7,s,(7))  ift</ 


A/(a,y,8i(y))  =  I 

For  convenience,  we  also  define  y  =  7  +  1. 


M(o,7,s<(7))-b 
[  M(a,7,8;(7))  -b  1  if  i  >  / 


By  the  above  definition,  we  know  M(o,7,Sj(7))  = 
0  for  I  <  /. 

Lemma  9  If  there  are  k  gaps  91,92,  ,9k  of  some 

configuration,  then  9i  =  k  —  I  (mod  n  -  1). 


Proof.  WLOG  assume  that  k  segments  are  clock¬ 
wise  ordered,  and  their  lengths  are  re¬ 

spectively.  Let  label  a  be  the  head  of  the  first  segment 
,  then  the  head  of  last  segment  has  label 

t-i 

6  =  a  -b  5^(/i  -b  -  1)  (mod  n  -  1) 

i=l 

and  since  6-b/fc  +  fft  —  l  =  a  (mod  n  —  1),  we  have 
5Zf=i(^«  +  —  1)  =  0  (mod  n  —  1).  Therefore 

k  k 

gi  =  li  +  k  =  n  +  k  =  k— I  (mod  n  —  1) 
i=l  •=! 


Lemma  10  Suppose  that  processor  P  takes  a  se¬ 
quence  of  n  consecutive  valid  steps  at  configurations 
7o.7i.  •.7n-i,  then  the  ring  consists  of  one  segment 

at  configuration  7n-i 

Proof.  WLOG  assume  that  at  71,  there  are  k  seg¬ 
ments  81  (71),  82(71  ),•■•,  8t (71)  clockwise  arranged 
on  the  ring  with  P  as  the  head  of  81(71),  and  k 
gaps  are  fli(7i). S2(7i),  • , fft(7i)  respectively.  Be¬ 

cause  there  are  at  most  n  —  1  segments  left  after  the 
first  valid  step  of  P,  we  know  that  k  <  n  -  1.  When 
P  takes  a  valid  step,  it  joins  the  next  segment  unless 
this  valid  step  is  a  merging  step.  So  that  there  exists 
a  configuration  71  (I  <  n  —  1)  such  that  P  is  about  to 
join  segment  81(71  +  1)  for  the  first  time  after  P  left 
segment  81(71). 

Claim  V  7i  <  7  <  7i,  if  P  is  in  segment  8/1(7)  at 
configuration  7  and  h  1,  then  for  all  j  >  h 

(♦)  »;(7)  +  Af(7i,7.S;(7))  =  0  (mod  n  -  1) 

Proof  of  Claim  (By  induction  on  t  =  7  —  71) 

Base  Case:  i  =  1 

7i  7  is  a  vzdid  step  of  P.  If  /»  >  2  then  h  =  k,  it 
implies  that  71  — ►  7  is  not  a  merging  step.  Hence 

9k(7)+M(7i,y,Sk(7))  =  9k(7i)+J^(7i,7uSk(7i))  =  0 

Inductive  Hypothesis:  Assume  that  (*)  holds  for  7  = 
7i  -b »  and  P  is  in  segment  81,(7). 

Let  P  be  in  segment  8*' (7  -b  1),  we  are  going  to 
show  that  (*)  holds  for  7  +  1  and  /»'.  We  have  four 
cases. 

Case  1  (7  — ►  7-b  1  is  not  a  merging  step)  and  (A'  =  h) 
Nothing  changes,  (*)  still  holds. 

Case  2  (7  — ►  7-b  1  is  not  a  merging  step)  and  (A'  <  A) 
Because  (*)  holds  for  i  >  A,  we  only  need  to  show  that 
(*)  holds  for  i  =  A'  too. 

Since  A'  <  A,  7  -+  7  -b  1  is  a  valid  step  of  P.  Thus 
9/,'(7  +  1)  =  9i,'(7)  =  0  and 
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Af(7i.T+  l.*h'(T+  1))  =  M{yuy,SH<(y))  =  0 

Case  3  (7  — ►  7  +  1  is  a  merging  step)  and  {h'  =  h) 
Assume  two  segments  Si{y),Sj(y)  (i  <  j)  are  merged 
by  this  step.  If  t  >  h  then  we  know  by  inductive 
hypothesis  that: 

gi(y)  +  M(yi ,  7, 8,(7))  =  0  (mod  n  -  1) 

ffj (y)  +  M(yi,y,sj(y))  =  0  (mod  n  -  1 ) 
Therefore 

ffi(7+  1)  +  M(7i,7+  l,s,(7+  1)) 

=  gi(y)  +  ffj(y)-l-f-M(yi,y,Si(y))  + 
A/(7i.7.«j(7))  +  1 
=  0  (mod  n  —  1) 

Case  4  (7  — ►  7  +  1  is  a  merging  step)  and  (h'  <  h) 

7  7  +  1  is  a  valid  step  of  Pi,  and  s/,<(y),s/,(y) 

are  merged  by  this  step.  So  that  (*)  holds  for 
«  >  A  by  inductive  hypothesis.  gh'(y)  =  0  and 
Af(7i)7.sJi'(7))  =  0  give  us 

Sh'(7+  1)  + A/(7i,7+  l,sv(7+  1)) 

=  »v(7)  +  flh(7)- 1  +  A/(7i,7,sv(7))  + 
Af(7i.7.«h(7))  +  1 
=  ^a(7)  +  ^^(7i.7.«a(7)) 

=  0  (mod  n  —  1) 

End  proof  of  Claim  ■ 

We  now  prove  the  lemma  by  contradiction.  As¬ 
sume  that  there  are  r  >  2  segments  at  configuration 
7j,  and  P  is  in  segment  Shiyi)-  By  the  definition 
of  I,  we  know  that  segments  8^(71)  and  81(7/)  are 
adjacent.  And  because  P  is  about  to  join  segment 
si(7i  +  l)i  9i{li)  =  0-  Recall  that  M(7i,7j,si(7j))  = 

0.  Applying  Claim  to  7/,  we  obtain 

gj (7/)+M(7i ,  7/,  8,(7/))  =  0  (mod  n- 1)  for  all  j 

=>  ^9jiyi)+^M{yi,yi,Sj{yi))  =  0  (mod  n-1) 
}  j 

By  Lemma  1,  fl'>(7/)  =  »■  -  1  (mod  n  -  1). 

=►  ^■^(7i.7/.Si(7/))  = 

J 

} 

This  implies  that 

Total  number  of  merges  between  71  and  yi 

>  ^^(7i,7hSj(7i)) 

i 

>  n  —  r 


Therefore,  there  are  at  most  k  —  {n  —  r)  segments  left 
at  configuration  yi.  But  t  —  (n  —  r)  <  r,  a  contradic¬ 
tion.  ■ 

Theorem  3  The  protocol  correctly  detects  the  stabil¬ 
ity  of  ring. 

Proof.  It  is  easy  to  see  that  once  the  system  is 
stabilized,  all  counters  will  eventually  reach  the  value 
3.  We  are  going  to  show  that  if  a  counter  C  has  value 
5,  then  the  system  is  stable. 

Assume  that  P  has  taken  m  consecutive  steps  of 
Rule  A  of  the  form  a.t  ^  — ►  (a  -1- 1).0  when  counter 
C  reaches  value  3,  and  that  the  i-th  step  has  the  form 
Oi-t  aj.u  — *  (oi  -I- 1).0,  for  i  =  1, 2,  •  •  ■ , m  .  We  then 
have  a,+i  =  -f  1  (mod  n  —  1)  and  the  sequence 
ni  I  <*2.  •  •  •  >  “m  contains  either  two  Os  and  one  or 
one  0  and  two  This  implies  that  m  >  n.  By 

Lemma  10,  the  ring  has  only  one  segment  in  it  now. 
Because  Rule  B  does  not  pass  a  tag  across  a  state 
with  label  0,  we  know  that  the  state  0  has  tag  0  when 
counter  C  has  value  4.  When  C  =  5,  all  tags  are  0, 
hence  the  ring  is  stable.  ■ 

6  The  Delay  of  Detection 

There  is  no  protocol  which  detects  the  stability  as 
soon  as  ring  becomes  stable.  We  study  the  amount 
of  delay  in  this  section.  Note  that  we  are  interested 
on  the  delay  after  the  ring  becomes  stable,  so  even 
though  the  ring  is  asynchronous,  it  makes  sense  to 
talk  about  cycles,  since  a  single  token  is  passed  adong 
the  ring,  and  these  are  the  only  enabled  transitions. 

At  a  legitimate  configuration,  a  processor  P  can 
have  state  0  and  its  observer’s  counter  C  can  have 
the  value  zero.  P’s  state  increases  by  one  each  cycle, 
counter  C  reaches  3  after  |(n  —  1)  —  1  cycles.  C  will 
be  5  after  two  more  cycles.  Therefore,  we  have 

Theorem  4  The  protocol  of  the  previous  section  has 
a  delay  of  cycles. 

We  can  reduce  the  delay  to  n  +  1  cycles  by  a  modi¬ 
fied  protocol  that  uses  a  flg(n-f  2)]  bit  counter  instead 
of  a  five  bit  one.  The  observer  O  initial  its  counter 
C  to  0.  If  its  processor  P  takes  a  step  of  Rule  A  of 
form  a.t  ^  -+  (a  -I-  l)./(6  —  a).  Then  P  sets 

1  it  b  =  a 
0  otherwise 

and  sends  61  to  its  observer  O.  After  receiving  61 
from  P,  O  increases  its  counter  by  one  unless  C  has 
reached  value  n-(-2.  The  observer  detects  the  stability 
when  C  =  n-f-2.  Compare  this  modified  protocol  with 
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the  original  protocol:  we  see  that  there  is  a  trade-olT 
between  the  delay  and  the  space  used  by  observers. 
This  delay  is  unavoidable,  as  shown  by  the  theorem 
below. 

Theorem  5  If  the  state  of  the  observers  is  influenced 
only  by  the  state  of  the  associated  processor  and  by  the 
messages  that  pass  through  it,  then  any  protocol  that 
detects  stability  has  a  delay  of  at  least  n  cycles  (n^ 
steps). 

Proof.  Let  us  consider  the  initial  configuration  70 

0.0  0.0  •  ■  ■  M 

The  processor  with  an  underlined  state  is  the  pro¬ 
cessor  chosen  by  the  demon  to  take  a  move  and  the 
rightmost  processor  is  P.  Let  O  be  P's  observer.  A 
sequence  of  steps  is  as  follows: 

0.0  0.0  O.OOJ  10 
0.0  0.0  •  •  •  Ml-O  1.0 

MIO  •••  1.01.0 

2./(n-2)  1.0  •  •  •  1.0  1.0  by  Rule  B 
2.0  1.0  •  •  ■  1.0  TO  [71] 

From  7o  to  71 ,  every  processor  takes  a  step  of  Rule  A, 
that  is  a  cycle.  O  has  information  (0.0, 1.0)[7o  — *  71] 
at  configuration  71 .  Similarly,  the  following  configu¬ 
rations  72, . .  .7n-3.7n-2i7n-i  will  be  reached: 

3.0  3.0  2.0  •••  2.02Ji 

(n  -  2).0  •  •  •  (n  -  2).0  (n  -  3).0  (n  -  3).0  (n  -  3).0 
0.0  0.0  ■  0.0  0./(n  -  2)  (n  -  2).0  (n  -  2).0 

I.OLQ  •••  1.0  l./(n-2)0.0 

Again,  from  7i  to  7j+i ,  i  =  1, 2,  ■  •  • ,  n  —  2,  every  pro¬ 
cessor  takes  a  step  of  Rule  A.  So  the  system  takes 
n  —  1  cycles  from  70  to  7n-i.  The  system  then  takes 
moves  by  the  following  sequence 

1.0  2.0  LQ  •••  1.0  1./(t»  -  2)  0.0 

1.0  2.0  (n-2).0  l./(n-2)0.0 

1.0  2.0  ■  (n-2).0  0./(2)M 

M2.0  ■  ■  ■  (n-2).0  0.f(2)1.0  [7„] 

The  observer  O  has  information 

(♦♦)  (0.0, 1.0, 2.0,  •  •  • ,  (n  -  1).0, 0.0, 1.0)[7o  -  7n] 

at  configuration  y„. 

Consider  another  initial  configuration  oq 
1.0  2.0  •••  (n-  2).0  0.0  0,Q 


Qo  is  a  legitimate  configuration.  Again,  we  assume 
that  P  is  the  rightmost  processor.  The  following  con¬ 
figurations  will  be  reached  from  oq. 


2.0  3.0  • 

••  (n-2).0  0.0  1.0L0 

[ai] 

3.0  4.0  • 

0.0  1.0  2.0  2J 

[«2] 

1.0  2.0  • 

••  (n-2).0  0.0M 

[ttfi-i] 

2.0  3.0  • 

••  (n-2).0  0.01.0  LO 

{«n] 

At  configuration  a„.  P's  observer  O  has  information 

(0.0, 1.0, 2.0,  ■  •  • ,  (n  -  1).0, 0.0, 1.0)[ao  ^  o„] 

This  information  is  exactly  the  same  as  (*«),  thus  O 
cannot  make  the  detection  at  configuration  a„.  Be¬ 
cause  the  system  takes  n  cycles  from  oq  to  a„,  any 
detecting  protocol  must  have  a  delay  of  at  least  n 
cycles.  ■ 

There  is  one  cycle  difference  on  delays  between 
modified  protocol  and  Theorem  3.  This  is  because 
we  didn’t  use  the  information  about  tags  in  our  proto¬ 
col.  If  detecting  protocols  only  use  information  about 
labels,  then  Theorem  3  holds  for  n  -|-  1  cycles.  To 
see  this,  note  that  the  following  configurations  can  be 
reached  from  7n 

1.0  2.0  •  ••  (n  -  2).0  0./(2)  LO 

[7„]  By  Rule  B 

L0  2.0  •••  (n  -  2).0  0./(2)  l./(2) 

2./(2)  3./(2)  •  •  •  (n  -  2)./(2)  0.0  1.0  l./(2) 
2./(2)  3./(2)  .  ■  •  (n  -  2)./(2)  0.0  1.0  2.0 

7  The  Number  of  States  Re¬ 
quired  by  BP  Type  Proto¬ 
cols 

As  mentioned  in  [2],  Seger  has  shown  that  any  cor¬ 
rect  uniform  unidirectional  protocol  for  a  ring  of  n 
processors  must  use  at  least  n  —  1  states  per  proces¬ 
sor.  There  is  still  a  gap  between  the  lower  bound  and 
the  upper  bound  on  the  number  of  states  per  proces¬ 
sor. 

It  is  clear  that  a  BP  type  protocol  with  a  good 
function  /  will  require  (n  —  1)|/(T)|  states  in  each 
processor. 

Two  functions  are  given  and  proven  to  be  good  in 
[Ij.  One  is  f{k)  —  k,  which  gives  a  protocol  with 
(n  —  l)(n  -  2)  states.  The  other  function  is  f{k)  = 
the  smallest  prime  divisor  ofn  —  k,  it  gives  a  protocol 
with  (n  —  1)]^  states  by  the  prime  number  theorem. 

Our  goal  is  to  find  a  good  function  which  has  small 
range. 
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Let  us  consider  s  class  T  of  functions  from  T  to  T: 
^  =  {/» I  h{k)  =  0  iff  t  =  n,  Md  if 

3  1  <  i  <  n  and  oi.aj, •  •  •  .a*  in  T -  {0}, 
such  that  ai  =  n,  then 
3  1  <  *i7  <  *  such  that  h(ai)  ^  h(aj)} 

Lemma  11  Lei  h  G  Define  f{k)  =  h{n  —  k). 
f  is  a  good  function.  In  other  words,  the  protocol 
defined  by  f  satisfies  condition  1  [No  Deadlock]  of 
Definition  1. 

Lemma  12  Let  f  be  a  function  from  T  to  T  such 
that  f(k)  =  0  iff  k  =  0.  If  the  protocol  defined  by  f 
is  correct,  then  hE  T,  where  h{k)  =  /(n  —  Jb). 

The  two  lemmas  above  show  that  the  problem  of 
finding  good  BP  type  protocols  is  equivalent  to  the 
following  interesting  mathematical  problem: 

Color  all  integers  between  1  and  n— 1  with  m  colors, 
such  that  given  any  I  <  k  <  n  and  for  any  group 
of  integers  {01,02,  a*}  colored  with  same  color, 

^  How  many  colors  are  enough?  How 
many  colors  are  necessary? 

Define 


k  if  1  <  fc  <  ^yn 

fv^  1  +  «  it  k  >  y/n  and 

0  if  ib  =  n 


Theorem  6  h  G 


Proof.  Suppose  that  ~  where  aj  G 

[1,  n]  ij,  1  <  k  <  n,  and  all  /»(oy)  have  same  value  t. 
Case  I:  t  <  y/n 

oi  =  02  =  •  •  •  =  o*  =  t.  Thus  t  is  divisor  of  n.  But 
n  is  a  prime. 

Case  2:  t  >  y/n 

By  definition,  we  have  7  <  Oj  <  for  all  j. 

Because  n  is  prime,  we  get  7  <  aj  <  for  all  j. 

Thus 

lb"<^o,</bv^ 

Since  ]2j=i  =  «,  we  get  i  -  1  <  /b  <  ».  This 

contradicts  the  fact  that  k  is  an  integer.  ■ 

This  yields  a  protocol  with  at  most  2n*  ®  states  in 
each  processor.  We  can  do  better,  at  some  effort. 


Theorem  7  For  sufficiently  large  n,  there  exists  a 
BP  type  protocol  which  uses  states 

in  each  processor. 


Proof.  Let  {pi,P2, •  • -.Pr}  be  all  distinct  primes 
<  ginn  =  *,  and  {pi,pj,  -  •  •  ,?•}  be  all  distinct 


primes  <  where  s  >  r.  By  the  prime  number 

theorem. 


s<0(- 


g(lnn  +  In  Inn  —  In  In  Inn) 


)<0{y 


In  n  In  In  n  ^ 


Set  m  =  n<=i  P*>  since  53p<* 


m  <  c 


=  eH'n"  =  n» 


(nlnlnn 
In  n 


Let  4>{m)  denote  the  number  of  positive  integers  <  m 
and  relatively  prime  to  m.  We  then  have 

=  TT(1--)  <  —  = - - - =0{ — 

m  p  Ini  Inlnn  — ln3  Mnlnn' 


Let  n  =  no  (mod  m),  no  <  m.  Because  n  is  prime, 
gcd(no,m)  =  1.  For  i  <  m,  if  gcd(t,  m)  =  1,  then 
yi  =  no  (mod  m)  has  unique  solution  y  (mod  m). 
Let  us  call  this  unique  solution  yt,  that  is,  iyi  =  no 
(mod  m).  Define 

f  0  if  y  =  n 


My)  = 


t  t  <  s  ,pt  is  the  smallest 

prime  divisor  of  y 

i\\/n  1  +  J  otherwise,  ify  =  i  (mod  m), 
i  <  m  and 

n  ^  n 


I  ;m+»i  —  y  ^  max(l,(i-l)m+yjV 

[  0<;<rj^l  +  l  (3 

The  function  h  is  well  defined  for  all  integers  1  < 
y  <  n.  If  A(y)  is  not  defined  by  (1)  tc  (2)  ,  then 

<  y  <  n  and  gcd(y,  m)  =  1.  Therefore,there 
exists  unique  pair  (t,y)  such  that 

y  =  i  (mod  m)  &  — - -  <  y  <  - ^  ,, - ; 

jm  +  yi  max(l,(y  -  l)m  +  y,, 

because  gcd(»,  m)  =  1 .  Thus,  we  know  the  range  of  h 
is  of  size: 

^  ,  ,,  . ,  Vn  In  In  n 

<  1  +  8  +  ^(m)( - > —  +  2) 

mvln  n 

V  Innlnlnn  mV  Inn 

^^yinnlninn  ^ 

We  now  show  that  h  G  T".  Suppose  that  53>=i 
where  aj  G  [l,n]  Vj,  1  <  ib  <  n,  and  all  h{aj)  have 
same  value  v. 
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Case  1:  v  <  y/n 

Pv  is  a  common  prime  divisor  for  all  ai.  This  implies 
that  pv  is  divisor  of  n,  but  n  is  a  prime. 

Case  2:  v  =  i\y/n  1  +  j 

Because  o/  =  t  (mod  m),  'il,  let  a;  =  mbi+i,  then 
n  =  22f=i  Therefore,  we  obtain 

that  ki  =  no  (mod  m),  that  is  k  —  yi  (mod  m). 

On  the  other  hand,  because  all  ai  cannot  be  the 
same,  we  have 


kn 

jm  +  yi 


k 

<  n  =  aj  < 
1=1 


_ in _ 

max(l,(j-  l)m  +  y,) 


max(l, (i  -  l)m  +  yi)  <  k  <  jm  +  yi 
This  contredicts  that  k  =  yi  (mod  m).  ■ 
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Abstract 

This  paper  considers  the  implementation  of  non- 
blocking  concurrent  objects  on  shared-memory  multi¬ 
processors.  Real  multiprocessors  have  properties  not 
present  in  theoretical  models;  these  properties  can  be 
exploited  to  design  non-blocking  protocols  that  are 
more  efficient  in  practice  than  those  allowed  by  theoret¬ 
ical  models.  These  new  protocols  rely  on  the  operating 
system  to  take  action  when  a  thread  of  control  is  de¬ 
layed  during  its  non-blocking  update.  We  illustrate  the 
effectiveness  of  this  approach  by  presenting  two  proto¬ 
cols  that  address  factors  hindering  the  performance  of 
Herlihy’s  standard  non-blocking  protocol  [Herlihy  90, 
Herlihy  91a].  These  factors  are:  resources  wasted  by 
attempted  non-blocking  operations  that  fail,  and  the 
cost  of  data  copying.  We  demonstrate  the  importance 
of  these  factors  experimentally,  and  show  how  they  can 
be  reduced  using  protocols  that  rely  on  operating  sys¬ 
tem  support.  To  reduce  the  overhead  of  failing  non- 
blocking  operations,  our  first  protocol  maintains  infor¬ 
mation  about  the  utilization  of  the  shared  object;  ex¬ 
periments  show  that  this  protocol  performs  better  than 
the  known  alternatives.  To  reduce  the  cost  of  data 
copying,  we  introduce  a  second,  optimistic  protocol  that 
avoids  copying,  except  in  the  case  when  a  thread  of  con¬ 
trol  is  delayed  during  its  attempted  update. 
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1  Introduction 

Programmers  of  shared-memory  multiprocessors  typ¬ 
ically  use  critical  sections  guarded  by  locks  to  en¬ 
sure  consistency  of  shared  objects.  Locks  are  well- 
understood  and  easily  supported  in  hardware,  and 
much  research  has  gone  into  the  development  of  locking 
protocols  with  low  latency  and  high  throughput  [Ander¬ 
son  90,  Mellor-Crummey  ic  Scott  91].  However,  locks 
suffer  from  a  number  of  problems,  the  most  important 
of  which  arise  from  the  interaction  between  locks  and 
CPU  scheduling.  For  example,  if  a  thread  of  control 
holding  a  lock  is  delayed  (say,  due  to  a  page  fault  or 
a  processor  preemption),  no  other  thread  may  operate 
on  the  object  protected  by  the  lock.  One  thread’s  delay 
may  prevent  the  entire  parallel  program  from  making 
progress  until  that  thread  is  able  to  run  again. 

Lamport  introduced  lock-free  synchronization  [Lam¬ 
port  77],  a  technique  that  allows  parallel  threads  to  en¬ 
sure  consistency  of  a  shared  object  while  avoiding  the 
problems  of  locks.  Lock-free  protocols  are  attractive 
because  they  ensure  that  a  thread  that  is  delayed  while 
updating  a  shared  object  does  not  prevent  other  threads 
from  making  progress.  Herlihy  proposed  a  specific 
methodology  for  implementing  non-blocking  shared  ob¬ 
jects  [Herlihy  90],  based  on  a  preprocessor  that  trans¬ 
forms  a  sequential  implementation  of  an  object  to  an 
equivalent  concurrent  non-blocking  implementation. 

Although  non-blocking  synchronization  is  a  promis¬ 
ing  idea,  it  has  not  been  used  much  in  practice.  One 
reason  is  that  non-blocking  synchronization  is  based  on 
atomic  primitives  that  are  not  available  on  most  par¬ 
allel  hardware.  Bershad  [Bershad  91b]  recently  showed 
that  with  appropriate  operating  system  support,  these 
primitives  can  be  implemented  efficiently  in  a  standard 
instruction  set.  Because  this  work  is  not  yet  widely 
known,  practical  experience  with  non-blocking  synchro¬ 
nization  is  limited. 

As  Bershad 's  work  implies,  there  are  important  dif- 
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ferences  between  real  systems  and  the  theoretical  mod¬ 
els  underlying  standard  non-blocking  protocols.  The 
most  important  difference  is  that  in  real  systems,  all 
events  causing  significant  delay  are  visible  to  the  oper¬ 
ating  system.  This  allows  the  design  of  non-blocking 
protocols  that  rely  on  the  operating  system  to  take  cor¬ 
rective  action  whenever  a  thread  experiences  a  delay. 
Operating  system  support  enables  a  much  richer  va¬ 
riety  of  protocols  for  non-blocking  synchronization;  we 
present  new  protocols  that  offer  significant  performance 
improvements  over  existing  non-blocking  protocols. 

We  identify  some  performance  problems  with  cur¬ 
rent  non-blocking  protocols,  and  suggest  strategies  to 
address  them.  Herlihy  [Herlihy  91a]  and  Bershad  [Ber- 
shad  91b]  have  pioneered  work  in  this  area,  addressing 
the  effects  of  contention  for  shared  objects.  We  focus 
on  the  following  two  problems:  performance  degrada¬ 
tion  caused  by  updates  that  fail,  and  the  cost  of  data 
copying.  We  propose  a  protocol  to  address  each  prob¬ 
lem;  both  protocols  rely  on  support  from  the  operating 
system. 

When  several  threads  of  control  are  simultaneously 
trying  to  update  a  shared  object,  only  one  can  suc¬ 
ceed  —  the  others  will  accomplish  nothing.  Updates 
that  fail  use  valuable  computational  resources  and  slow 
the  progress  of  other  threads  doing  useful  work.  Her¬ 
lihy  [Herlihy  91a]  addressed  this  problem  by  propos¬ 
ing  a  policy  that  uses  exponential  backoff  to  allevi¬ 
ate  contention  for  the  shared  object.  We  propose  a 
general  framework  for  addressing  this  problem;  within 
this  framework  many  specific  policies  are  possible.  We 
present  one  such  policy,  and  demonstrate  experimen¬ 
tally  its  performance  advantage  over  the  standard  pro¬ 
tocol  and  over  exponential  backoff. 

Data  copying  is  a  major  component  of  current 
non-blocking  protocols.  We  present  the  result  of  a 
study  [Beeck  k  LaMarca  91]  showing  that  copying  can 
cause  significant  performance  degradation,  and  we  pro¬ 
pose  an  optimistic  protocol  for  non-blocking  objects 
that  avoids  copying  in  the  common  case  when  a  thread 
is  not  delayed  during  its  update. 

2  Differences  Between  Theoret¬ 
ical  Models  and  Real  Systems 

Existing  bus-based,  shared-memory  multiprocessor  sys¬ 
tems  differ  in  several  ways  from  standard  theoreti¬ 
cal  models  of  asynchronous  shared-memory  computers. 
There  are  two  fundamental  differences  relevant  to  this 
paper:  the  use  of  threads  rather  than  physical  proces¬ 
sors,  and  the  predictable  nature  of  delays. 


Rather  than  programming  processors  directly,  pro¬ 
grammers  express  parallelism  in  terms  of  threads  of  con¬ 
trol.  Roughly  speaking,  each  thread  can  be  viewed  as  a 
virtual  processor;  a  program’s  threads  are  multiplexed 
onto  physical  processors.  As  in  the  theoretical  model, 
the  threads  communicate  via  shared  memory,  and  syn¬ 
chronize  by  using  mutual  exclusion  mechanisms  such 
as  locks  and  condition  variables.  The  operating  system 
kernel  allocates  physical  processors  (dynamically)  be¬ 
tween  the  parallel  programs  that  are  running.  In  turn, 
each  program  schedules  its  own  threads  onto  its  pro¬ 
cessors;  this  scheduling  is  typically  done  by  a  runtime 
system  that  is  linked  with  the  parallel  program. 

A  more  significant  difference  resides  in  the  nature 
of  delays  experienced  by  processors  in  the  theoretical 
model,  and  threads  in  real  programs.  In  the  asyn¬ 
chronous  model,  processors  are  assumed  to  experience 
arbitrary  delays  at  arbitrary  times.  There  is  no  way  to 
tell  how  long  a  delay  will  last,  or  whether  it  will  ever 
end.  In  fact,  the  failure  of  a  processor  is  often  modeled 
as  an  infinite  delay. 

In  contrast,  on  real  systems,  all  delays  experienced 
by  a  thread  can  be  divided  into  three  classes: 

1.  Short  delays:  These  delays  are  common,  but  have 
a  duration  of  at  most  a  few  tens  of  clock  cycles. 
Short  delays  are  caused  by  events  such  as  cache  and 
translation-buffer  misses,  bus  and  memory  con¬ 
tention,  and  timer  interrupts.  Programmers  typi¬ 
cally  do  not  think  about  individual  delays  of  this 
type,  but  rather  model  them  as  a  uniform  degra¬ 
dation  in  execution  speed. 

2.  Long  delays;  In  addition,  threads  suffer  from  de¬ 
lays  of  long  duration,  e.g.  100,000  clock  cycles  or 
more.  These  delays  are  caused  by  page  faults,  I/O 
operations,  processor  preemptions,  and  reschedul¬ 
ing  of  threads.  As  Bershad  observed  [Bershad  91b], 
all  long  delays  are  caused  by  operating  system 
events,  so  the  operating  system  is  always  aware 
when  a  long  delay  begins  or  ends. 

3.  Infinite  delays,  or  failures:  Real  shared-memory 
multiprocessors  are  not  robust  against  failures  of 
hardware  or  critical  software  components.  Such 
failures  are  considered  catastrophic;  those  applica¬ 
tions  that  need  to  recover  from  failures  use  external 
mechanisms  such  as  checkpointing  or  transactional 
logging.  In  this  paper,  we  will  not  concern  our¬ 
selves  with  failures  —  instead,  we  will  assume  that 
fauIt-toIcrance  is  handled  at  another  level  of  the 
system. 

To  summarize,  in  real  multiprocessor  systems  both 
short  and  infinite  delays  can  be  ignored.  Only  long  de- 
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lays  have  a  significant  effect,  and  they  are  always  known 
to  the  operating  system. 

3  Blocking  vs.  Non-blocking 
Synchronization  in  Real  Sys¬ 
tems 

In  the  theoretical  model,  a  non-blocking  shared  object 
guarantees  that  some  processor  will  succeed  in  access¬ 
ing  the  object  within  a  bounded  length  of  time.  In  real 
systems,  long  delays  are  bounded  in  length,  and  hap¬ 
pen  at  most  once  per  instruction;  therefore,  even  the 
simplest  locking  protocols  are  technically  non-blocking. 
This  is  not  much  consolation  to  the  programmer:  if  a 
thread  holding  a  lock  experiences  a  long  delay,  no  other 
thread  may  access  the  object  protected  by  the  lock.  Al¬ 
though  the  delay  is  finite,  the  performance  penalty  is 
unacceptable  in  practice. 

Ideally,  we  would  like  a  guarantee  that  progress  will 
be  made  in  time  significantly  less  than  a  long  delay. 
Mutual  exclusion  mechanisms  do  not  have  this  prop¬ 
erty,  since  the  thread  with  exclusive  access  may  suffer  a 
long  delay.  Non-blocking  protocols  are  less  sensitive  to 
delays  than  are  mutual  exclusion  protocols,  because  one 
thread’s  delay  does  not  prevent  other  threads  from  mak¬ 
ing  progress.  This  motivates  the  use  of  non-blocking 
protocols  on  real  systems. 

An  key  feature  of  real  systems  is  that  the  beginning 
and  end  of  a  long  delay  are  always  known  to  the  oper¬ 
ating  system.  This  fact  allows  us  to  propose  protocols 
that  rely  on  the  operating  system  to  take  some  action 
when  a  long  delay  begins  or  ends.  As  a  result,  a  much 
wider  range  of  non-blocking  protocols  can  be  designed. 
For  example,  the  non-blocking  protocols  we  propose  in 
sections  6  and  7  depend  on  such  operating  system  sup¬ 
port. 

Bershad  was  the  first  to  use  operating-system  mecha¬ 
nisms  to  support  non-blocking  synchronization;  he  im¬ 
plemented  a  non-blocking  Compare-and-Swap  opera¬ 
tion  on  a  multiprocessor  that  did  not  have  a  hardware 
Compare-and-Swap  primitive  [Bershad  91b].  Bershad ’s 
work  allowed  existing  non-blocking  protocols  to  be  im¬ 
plemented  on  a  wider  range  of  machines. 

The  main  contribution  of  our  paper  is  to  show  how 
operating  system  support  can  be  exploited  to  design 
a  wider  range  of  non-blocking  protocols  than  those  al¬ 
lowed  by  the  theoretical  model.  Our  protocols  rely  on 
the  fact  that  the  operating  system  can  perform  a  va¬ 
riety  of  actions  in  response  to  delays.  By  exploiting 
this  flexibility,  we  develop  protocols  which  offer  better 
performance  in  practice. 


4  Existing  Non-Blocking  Proto¬ 
cols 

Lamport  introduced  lock-free  .synchronization  [Lam¬ 
port  77],  a  technique  that  allows  parallel  threads  to 
ensure  consistency  of  a  shared  object  without  requir¬ 
ing  mutual  exclusion.  Lock-free  shared  objects  caui 
be  divided  into  non-blocking  objects  and  wait-free  ob¬ 
jects  [Herlihy  91b].  Non-blocking  objects  guarantee 
that  some  thread  accessing  the  object  will  complete  its 
operation  in  a  fixed  number  of  steps.  Wait-free  objects 
guarantee  that  all  threads  will  complete  their  accesses 
to  the  object  within  a  fixed  number  of  steps.  In  this 
paper  we  consider  only  non-blocking  objects. 

Herlihy  [Herlihy  88,  Herlihy  91b]  has  shown  that  it  is 
impossible  to  construct  non-blocking  implementations 
of  arbitrary  concurrent  objects  with  any  combination  of 
atomic  read,  write,  fetch-and-op  and  memory  to  mem¬ 
ory  swap.  There  are,  however,  universal  atomic  op¬ 
erations  which  are  capable  of  implementing  arbitrary 
non-blocking  objects  [Herlihy  91b].  The  best-known  of 
these  universal  primitives  are  Compare-and-Swap,  and 
the  combination  of  Load- Linked  cind  Store- Conditional. 

4.1  Herlihy’s  Methodology  for  Non- 
blocking  Objects 

Herlihy  [Herlihy  90,  Herlihy  91a]  introduced  a  tech¬ 
nique  by  which  a  preprocessor  can  transform  an  im¬ 
plementation  of  an  arbitrary  object  into  an  equivalent 
non-blocking  concurrent  implementation.  Threads  op¬ 
erating  on  the  non-blocking  object  follow  the  protocol 
illustrated  by  the  pseudocode  in  Figure  1.  In  order  to 
perform  a  non-blocking  update  of  the  shared  object,  a 
thread  first  acquires  a  pointer  to  the  current  version 
of  the  shared  object  and  uses  this  pointer  to  make  a 
private  copy  of  this  version.  Then,  the  thread  updates 
this  private  copy  and  attempts  to  install  it  as  the  new 
version  of  the  shared  object.  The  thread’s  non-blocking 
operation  succeeds  if  no  new  version  of  the  shared  ob¬ 
ject  has  been  installed  since  the  thread  began  its  oper¬ 
ation.  Otherwise,  the  thread’s  operation  fails,  and  the 
thread  tries  its  update  again. 

This  protocol  is  based  on  atomic  primitives  we  will 
call  take-snapshot  and  check-andJnstall.  Their 
specifications  are  as  follows:  takejsnapshot  returns 
a  pointer  to  the  current  version  of  the  shared  ob¬ 
ject.  check^andJnstall  installs  a  new  version  of 
the  shared  object  if  and  only  if  no  new  version  has 
been  installed  since  the  caller’s  last  take-snapshot. 
check^andJnstall  returns  Success  if  the  new  version 
was  installed,  and  Failure  otherwise.  Different  instruc¬ 
tions  can  be  used  to  implement  take-snapshot  and 
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private  var 

privaiejvtrsion  :  POINTER  TO  Object; 

procedure  NonblockingUpdate(var  shared-object :  SharedObject)  { 
var 

snapshot  :  POINTER  TO  Object; 
repeat  { 

snapshot  :=  takejnapshot(8/iared-o(jec(); 
copy  object  from  *snapshot  to  *private.version; 
computejiew_value  (privote.version); 

}  until  (check.andJnstall  {shared-object,  private.version)  =  Success); 


Figure  1:  Pseudocode  for  Herlihy’s  standard  non-blocking  protocol.  Private  variables  are  kept  on  a  per-thread  basis. 
The  operator  dereferences  a  pointer.  Certain  details  of  memory  management  are  omitted. 


check^and  jnstall.  Herlihy  originally  proposed  using 
Load  for  take-snapshot  and  Compare-and-Swap  ^  for 
check-andJnstall  [Herlihy  90].  Later,  Herlihy  advo¬ 
cated  using  Load-Linked  for  take-snapshot  and  Store- 
Conditional  for  check-andJnstall  [Herlihy  91a]. 


4.2  Implementation  of  takejsnapshot 
and  check-and -install 

For  concreteness,  Figure  2  shows  an  implementation 
of  takejBoapshot  and  check_andJn8tall  based  on 
timestamps.  Each  attempted  update  of  the  shared 
object  is  tagged  with  a  timestamp;  timestamps  are 
unique  and  increasing  in  time.  A  shared  variable 
killJimestamp  is  maintsuned.  When  an  update  suc¬ 
ceeds,  it  increases  killJimestamp  to  equal  the  highest 
timestamp  given  out  so  far;  updates  vith  timestaunps 
less  than  or  equal  to  killJimestamp  are  operating  on  a 
stale  version  of  the  shared  object  and  will  fail. 


>Compare-and-Sw^does  not  exactly  satisfy  the  specification 
for  check^ndJnstall.  Compare-and-Swap  checks  whether  the 
location  kss  the  seme  vsfse,  whereas  check.and  Jnstall  is  re¬ 
quired  to  check  whether  the  location  Ass  ieen  modified.  It  may 
be  the  case  that  the  location  has  changed  value  firom  A  to  B 
(say),  and  then  back  to  A.  In  this  case,  Compare-and-Swap 
will  succeed  when  checkutnd  Jnstall  would  have  failed.  In  Her¬ 
lihy’s  protocol,  take^napshot  and  cbeckjand  Jnstsdl  are  done 
on  locations  containing  pointers  to  memory  bufieis;  Compare- 
and-Swiq>  may  incorrectly  succeed  if  memory  buffers  are  recy¬ 
cled.  Herlihy  dealt  with  this  problem  by  modifying  the  underly¬ 
ing  buffer  management  protocol.  The  added  complexity  of  tlus 
buffer  management  scheme  is  one  reason  Herlihy  later  favored 
Load-Linked/Store-Conditional  over  Compare-and-Swap. 


4.3  Non-blocking  Synchronization 
Without  Hardware  Support 

Originally,  the  lack  of  hardware  support  for  Compare- 
and-Swap  or  Load-Linked  and  Store-Conditional  was  a 
fundzumental  obstacle  to  using  Herlihy’s  methodology. 
However,  Bershad  [Bershad  91b]  observed  that  the  lack 
of  atomic  instructions  in  hardware  could  be  remedied 
given  appropriate  operating  system  support.  He  sug¬ 
gested  software  implementations  using  critical  sections 
(guarded  by  locks)  to  atomically  simulate  Compare- 
and-Swap  with  “regular”  instructions.  When  a  thread 
inside  one  of  these  critical  sections  is  delayed,  the  op¬ 
erating  system  notifies  the  runtime  system  which  in 
turn  backs  the  delayed  thread  out  of  the  critical  sec¬ 
tion  or  rolls  it  forward  past  the  critical  section  [Ber¬ 
shad  91a].  Because  of  the  special  structure  of  the  criti¬ 
cal  section  implementing  Compare-and-Swap,  this  roll¬ 
back  or  roll-forward  is  always  possible.  Bershad  showed 
that  the  performance  of  his  software  implementation  of 
Compare-and-Swap  was  acceptable:  due  to  the  small 
size  of  the  critical  section  implementing  Compare-and- 
Swap,  threads  were  almost  never  delayed  within  it. 
Bershad ’s  technique  extends  to  other  primitives  in¬ 
cluding  Load-Linked  and  Store-Conditional,  and  the 
timestamp-based  implementations  of  take-snapshot 
and  check.and-m8tall  shown  in  Figure  2. 

5  Problems  with  Current  Non- 
blocking  Protocols 

Although  operating  system  support  allows  us  to  com¬ 
pensate  for  the  lack  of  hardware  support,  a  more  fundar 
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rpe  SkmredObject  is 

cwrrenlLversiON :  POINTER  TO  Object; 
eurrentJimestati^^ :  latter  INITIALLY  0; 
kiilMmettamp :  Integer  INITIALLY  0; 
ad 

>rivate  var 

my_(ime«tamp :  Integer; 

>rocedare  take jBiiapshot( var  obj :  SharedObject){ 
var 

snapshot :  POINTER  TO  Object; 
atomkally  { 

snapshot  :=  obj.currenLversion; 
increment  obj.cumnLtimestamp-, 
my.timestamp  :=  obj.currentJimestamp; 

} 

return  snapshot, 

I 

procedure  check.andJn8tall(var  obj :  SharedObject,  new.version :  POINTER  TO  Object){ 
atomically  { 

if  {my^timestamp  >  obj.kULtimestamp){ 
obj.cumntuersion  :=  new.version-, 
obj.kilUimestamp  :=  obj.currentJiniestamp-, 
return  Success; 

}  eke  { 

return  Failure; 

) 

} 

} 


Figure  2:  Pseudocode  for  a  timestamp-based  implementation  of  take-snapshot  and  check.juidJiistall. 
Private  variables  are  kept  on  a  per-thread  basis. 


mental  problem  with  Herlihy’s  methodology  is  the  over¬ 
head  that  results  from  having  multiple  threads  compute 
a  new  version  of  the  shared  object.  This  overhead  can 
be  dissected  into  two  components;  useless  parallelism 
and  unnecessary  copying. 

Useless  Parallelism  Consumes  Resources. 

Under  the  standard  protocol,  several  threads  may 
simultaneously  attempt  to  update  the  shared  ob¬ 
ject.  Of  all  these  threads  only  one  will  succeed  in 
its  update;  all  the  other  threads  will  fail  and  try 
again.  Threads  that  fail  are  using  computational 
resources  that  might  be  put  to  better  use,  for  in¬ 
stance  by  running  another  thread  on  that  proces¬ 
sor. 

Even  worse,  the  failing  threads  use  resources  such 
as  the  memory  bus  when  attempting  their  updates, 
degrading  the  performance  of  the  thread  that  is 
successful.  In  short,  the  losers  not  only  lose  but 


they  slow  down  the  winner. 


Unnecessary  Copying  Slows  Down  Updates.  In 
Herlihy’s  methodology  each  thread  makes  a  copy 
of  the  shared  object  and  then  performs  a  computer 
tion  on  this  copy.  This  ensures  that  when  any  other 
thread  examines  the  shared  object,  it  sees  it  in  a 
consistent  state.  The  drawback  is  that  this  forces 
threads  to  copy  the  whole  object  even  if  they  will 
modify  only  a  small  part.  In  cases  where  only  small 
sections  of  the  object  are  modified  this  will  result 
in  unnecessary  copying  which  will  slow  down  com¬ 
putation  even  further,  especially  if  memory  band¬ 
width  is  scarce. 
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6  Reducing  Useless  Parallelism 

We  can  reduce  the  performance  cost  of  failed  updates  by 
controlling  the  number  of  threads  that  simultaneously 
attempt  updates  to  a  single  object.  Excess  threads  can 
“get  out  of  the  way”  by  yielding  their  processor  to  an¬ 
other  thread,  or  simply  by  waiting  for  the  demand  for 
the  shared  object  to  abate. 

There  are  many  possible  policies  to  control  the  num¬ 
ber  of  simultaneous  update  attempts.  Of  course, 
any  policy  must  satisfy  the  non-blocking  property  — 
no  thread  can  be  prevented  from  beginning  its  up¬ 
date,  unless  some  other  thread  is  currently  making 
progress  on  its  update.  The  implementer  is  free  to 
choose  policies  on  a  case-by-case  basis,  taking  advan¬ 
tage  of  application-specific  knowledge  if  available.  Her- 
lihy  [Herlihy  91a]  proposes  one  policy  of  this  type,  based 
on  exponential  backoff. 


6.1  A  Family  of  Protocols  that  Reduce 
Useless  Parallelism 

In  this  subsection  we  present  a  family  of  policies  for 
reducing  useless  parallelism.  These  policies  maintain 
a  count  of  the  number  of  updates  to  the  shared  ob¬ 
ject  currently  in  progress.  Threads  defer  their  updates 
if  this  count  is  greater  than  some  (usually  small)  con¬ 
stant  K.  Threads  increment  the  count  when  they  begin 
their  update,  and  decrement  the  count  when  they  fin¬ 
ish  their  update.  When  a  thread  is  delayed  during  its 
update,  the  runtime  system  decrements  the  counter  so 
another  thread  may  begin  its  update;  this  guarantees 
the  nonblocking  property.  When  the  delayed  thread  is 
awakened,  the  runtime  system  increments  the  counter. 
The  full  pseudocode  for  our  algorithm  is  shown  in  fig¬ 
ure  3. 

At  first  glance,  this  approach  appears  no  different 
from  conventional  mutual  exclusion.  In  fact,  in  the  ab¬ 
sence  of  delays,  the  algorithm  behaves  the  same  as  mu¬ 
tual  exclusion.  However,  if  a  thread  is  delayed  while  it 
is  operating  on  its  copy  of  the  shared  object,  the  oper¬ 
ating  system  notices  this  delay,  and  notifies  the  runtime 
system,  which  in  turn  allows  another  thread  to  start  an 
operation.  A  thread  that  is  delayed  dees  not  prevent 
other  threads  from  m^ing  progress;  the  delayed  thread 
is  working  only  on  its  private  version  of  the  object,  and 
other  threads  still  see  a  consistent  public  version.  Our 
algorithm  is  non-blocking,  while  standard  mutual  ex¬ 
clusion  protocols  are  not. 


6.2  Performance  Implications 

To  illustrate  the  performance  improvement  possible 
by  reducing  useless  parallelism,  we  conducted  an  ex¬ 
periment  on  three  non-blocking  implementations  of  a 
32-byte  object.  One  implementation,  called  PLAIN, 
used  Load-Linked  and  Store-Conditional,  implemented 
in  software  using  spinlocks.  The  second  implementa¬ 
tion,  called  BACKOFF,  used  exponential  backoff  as 
suggested  by  Herlihy  [Herlihy  91a].  The  final  imple¬ 
mentation,  called  SOLO,  used  the  policy  of  section  6.1 
with  K  =  1.  We  ran  one  thread  on  each  processor;  each 
thread  was  programmed  to  alternate  between  short  pe¬ 
riods  of  private  computation  and  operations  on  the 
shared  object.  (The  length  of  the  private  computation 
periods  was  chosen  to  be  about  half  the  time  required 
for  an  attempted  update  of  the  shared  object.)  For  each 
implementation,  we  measured  the  total  number  of  suc¬ 
cessful  non-blocking  operations  per  second  as  a  function 
of  the  number  of  processors.  The  measurements  were 
taken  on  a  Sequent  Symmetry  with  20  processors,  and 
the  results  are  shown  in  Figure  4. 

The  figure  shows  that  the  throughput  of  our  protocol 
is  higher  than  that  of  the  plain  protocol  and  the  back¬ 
off  protocol.  All  three  protocols  do  the  same  amount  of 
copying;  the  performance  differences  can  be  attributed 
to  bus  contention,  which  can  be  traced  to  two  causes: 
contention  for  synchronization  variables  such  as  locks 
and  the  count  of  active  threads,  and  failing  updates’ 
use  of  the  bus  to  make  their  private  copy  of  the  shared 
object.  PLAIN  suffers  from  contention  of  both  types. 
SOLO  does  not  have  any  failing  updates;  its  perfor¬ 
mance  degrades  as  processors  are  added  because  of  con¬ 
tention  for  the  count  of  active  threads.  BACKOFF  suf¬ 
fers  from  synchronization  contention,  from  occasional 
failing  updates,  and  from  periods  in  which  no  thread 
is  attempting  an  update  but  all  waiting  threads  are 
“backed  off”. 

The  experiment  fails  to  measure  a  real-life  effect  that 
favors  our  SOLO  protocol  over  PLAIN  and  BACKOFF. 
In  a  practical  situation,  threads  that  defer  their  up¬ 
dates  in  the  SOLO  protocol  could  yield  their  processors 
to  other  threads,  which  could  accomplish  useful  work. 
In  the  experiment,  these  threads  simply  waited  for  the 
count  to  decrease  below  K. 

7  Reducing  Unnecessary  Copy¬ 
ing 

Recall  that  in  current  non-blocking  protocols,  every 
time  a  thread  attempts  to  update  a  non-blocking  ob¬ 
ject,  it  must  copy  the  entire  object.  If  the  object  is 
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eomt 

K :  bt«ger; 

typ«  Shai«dObj«ct  k 

:  Intager  INITIALLY  0; 
e«m«<.»er«t»»  :  POINTER  TO  Object; 
eMTrent.time»t»mf  :  Integer  INITIALLY  0; 
kilLtimeit*mp  :  Integer  INITIALLY  0; 
end 

private  vnr 

prieete-verrien  :  POINTER  TO  Object; 

««ep«Ae<  :  POINTER  TO  Object; 
mp.time»Ump  ;  Integer; 

IHraoedure  N<»iblockingUpdate(var  elj  :  SharedObject){ 
repent  { 

while(eIi.eeliee_tAreai«  >=  A){  defer();  } 

•n€p$kot  :s  t«ke.«nnpehot(  elj); 

copy  object  firom  to  *ynee(e.«er«to«; 

computejtew.value  (yneete.««rfiem); 

}  until  (d»erk  end  inetnll  (okj,  yn*c<e.eer«ten)  =  5«eee«<); 

} 

^ocedure  t«kejinapehot(var  oij  ;  SharedObject){ 
atomically  { 

increment  otj.Meiive-tkrtatU; 
t%*p»koi  :s  elj.eerrenCeerrion; 
increment  of}.e«rreiit.<imefteiny; 
niy.time«lamy  :=  etj.eerreiiLltnKjUmy; 

} 

return  «aey«Aot; 

} 

procedure  check.and  Jnatall(var  oij  :  SharedObject,  •ew.eerrioa  :  POINTER  TO  Object)( 
atomically  { 

decrement  «li.ee<i«e.<Area^; 
if  (mp.timett*mp  >  0kj.HILlmt»Ump){ 

0kj.e*rrtnt.0eriion  is:  aew.eereion; 

»kj.kilLtimeit0np  :=  oij.e«rrent.(ime«<amy; 
return  5«ece««; 

}eke{ 

return  Ftihtrti 

} 

} 

} 

procedure  TlireadOelayBegfaLNotify(var  okj  :  POINTER  TO  SliaredObjet:t){ 
atomically  decrement  okj.meti0e.ikrt*dr, 

} 

procedure  TlireadDdayEiMLNotify(var  okj  ;  POINTER  TO  SharedObject){ 
atomically  increment  okj.oetioe-tkmio; 

} 


Figure  3:  Pseudocode  for  a  non-blocking  protocol  that  reduces  useless  parallelism.  Private  variables  are  kq>t  on  a 
per-thread  basis.  The  operator  dereferences  a  pointer.  The  protocol  maintains  a  variable,  scf tveJkresdt,  that 
is  equal  to  the  number  of  threads  actively  executing  updates.  Threads  entering  the  procedure  NonblockingUpdate 
wait  until  aetiveJhreodi  is  less  than  some  constant  K  before  trying  their  updates;  during  this  waiting  period  the 
threads  may  spin,  or  they  may  yield  their  processor  to  another  thread.  The  runtime  sjrstem  calls  the  procedure 
ThreadDelayBegin.J)otify  whenever  a  thread  is  delayed  during  its  update;  the  procedure  ThreadDelayEnd-NotilV 
is  called  when  the  delay  ends. 
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Figure  4:  Throughput  of  three  implementations  of  a  32-byte  non-blocking  shared  object.  Throughput  is  the  number 
of  successful  non-blocking  operations  per  second. 


large,  and  the  thread  needs  to  modify  a  small  part  of 
it,  copying  the  whole  object  is  wasteful.  This  was  il¬ 
lustrated  by  a  study  of  non-blocking  implementations 
of  a  name-server  on  our  Sequent  Symmetry  [Beeck  & 
LaMarca  91].  The  implementations  used  Herlihy’s  pro¬ 
tocol  using  Load  for  takejsnapshot  and  Compare~and- 
Swap  for  check-andJnstall.  The  Compare-and-Swap 
primitive  was  implemented  using  a  short  critical  section 
guarded  by  a  lock. 


This  study  found  that  the  amount  of  copying  involved 
in  the  non-blocking  operation  had  a  direct  impact  on 
perform2mce,  as  shown  in  Figure  5.  The  authors  origi¬ 
nally  followed  Herlihy’s  approach  and  implemented  the 
non-blocking  protocol  based  on  a  sequential  implemen¬ 
tation  of  the  name-server.  They  quickly  reedized  that 
performance  could  be  improved  by  reducing  the  amount 
of  copying.  Using  optimizations  specific  to  the  seman¬ 
tics  of  the  name-server,  they  were  able  to  substan¬ 
tially  reduce  the  amount  of  copying.  Figure  5  shows 
the  elapsed  time  required  by  different  implementations 
of  the  name-server  to  perform  10080  operations.  The 
three  implementations  differ  in  the  runount  of  copying 
each  thread  must  do  to  build  its  private  copy  of  the 
shared  object.  The  most  optimized  implementation  re¬ 
quired  copying  four  bytes  of  memory. 


7.1  An  Optimistic  Protocol  to  Reduce 
Copying 

As  shown  in  Figure  5,  making  a  full  copy  of  the  shared 
object  for  each  update  carries  a  significant  performance 
price.  We  would  like  to  reduce  the  amount  of  copying 
that  is  necessary.  We  can  do  this  by  enlisting  further 
support  from  the  runtime  system. 

We  present  an  optimistic  protocol,  which  is  designed 
to  reduce  copying  in  the  common  case  when  a  thread 
is  not  delayed  while  updating  the  shared  object.  In 
the  uncommon  case  when  a  thre2ul  is  delayed  during 
its  update,  the  runtime  system  steps  in  and  restores  a 
consistent  state. 

The  optimistic  protocol  is  based  on  our  protocol  of 
section  6.1  with  K  =  I  (the  SOLO  protocol).  This 
protocol  allows  only  one  thread  to  attempt  an  update 
at  a  time;  all  other  threads  either  spin  waiting  for 
this  thread  to  finish,  or  yield  their  processors  to  other 
threads.  Only  if  a  thread  is  delayed  during  its  update  is 
a  second  thread  allowed  to  begin  another  update.  When 
only  one  thread  is  carrying  out  an  update,  it  can  take 
advantage  of  this  fact  by  working  directly  on  the  public 
version,  rather  than  copying  it.  Of  course,  the  thread 
must  keep  a  log  of  its  changes,  so  the  runtime  system 
can  restore  a  consistent  state  if  necessary.  Additional 
threads  will  not  notice  that  the  object  is  in  an  inconsis¬ 
tent  state,  because  their  updates  will  not  be  allowed  to 
start  until  the  object  is  once  again  in  a  consistent  state. 
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Figure  5;  Elapsed  time  for  10080  operations  in  a  non-blocking  name-server.  The  three  implementations  differ  only 
in  the  amount  of  data  copied  in  each  attempted  non-blocking  update. 


When  starting  its  update,  a  thread  “borrows”  the 
current  public  version  of  the  shared  object.  It  performs 
its  changes  directly  on  the  borrowed  version,  maintain¬ 
ing  a  log.  When  the  thread  completes  its  update,  it 
re-installs  the  borrowed  version  as  the  new  “official” 
version,  and  discards  its  log. 

If  a  thread  is  delayed  during  its  update,  the  runtime 
system  uses  the  thread’s  “dirty”  borrowed  version  and 
the  thread’s  log  to  build  a  consistent  version  of  the 
shared  object.  This  consistent  version  is  identical  to 
the  version  that  the  thread  originally  borrowed.  The 
runtime  system  installs  this  reconstructed  version  as 
the  current  public  version,  and  allows  another  thread 
to  begin  its  update. 

This  second  thread  follows  the  same  procedure;  it 
borrows  the  current  public  version,  and  updates  this 
borrowed  version  directly,  while  logging  its  changes.  If 
this  second  thread  is  delayed,  the  runtime  system  will 
once  again  use  the  delayed  thread’s  borrowed  version 
and  log  to  rebuild  and  install  a  consistent  version  of 
the  shared  object.  A  third  thread  is  now  allowed  to 
begin  its  update.  In  this  manner,  an  arbitrary  number 
of  threads  can  be  delayed  during  their  updates  with¬ 
out  preventing  further  threads  from  making  progress. 
If  several  threads  are  working  on  updates  simultane¬ 
ously,  whichever  thread  finishes  first  will  succeed,  and 
the  other  threads  will  fail. 


7.2  Performance  TVadeofFs 

The  optimistic  protocol  is  best  suited  for  those  cases 
in  which  updates  modify  only  a  small  portion  of  the 
shared  object,  and  threads  are  rarely  delayed  during 
their  updates.  In  these  cases,  the  amortized  cost  of 
keeping  the  log  and  reconstructing  the  shared  object  is 
small  relative  to  the  cost  of  copying  the  whole  object 
on  every  update. 

On  the  other  hand,  if  these  assumptions  are  not  true, 
a  protocol  such  as  SOLO  will  outperform  the  optimistic 
protocol.  We  note  that  it  is  possible  to  switch  back  and 
forth  “on  the  fly”  between  the  optimistic  protocol  and 
the  SOLO  protocol.  The  only  difference  is  that  threads 
borrow  the  public  version  in  the  optimistic  protocol, 
and  they  copy  the  public  version  in  the  SOLO  protocol. 
By  switching  between  borrowing  and  copying,  we  can 
effectively  switch  between  protocols.  This  allows  the 
system  to  choose  its  protocol  adaptively. 

8  Future  Work 

Realistic  implementation  of  the  ideas  presented  in  this 
paper  requires  support  for  an  operating  system  mecha¬ 
nism  such  as  scheduler  activations  [Anderson  et  al.  92]. 
We  expect  to  have  a  version  of  Mach  with  scheduler 
activations  running  on  our  Sequent  within  the  next  few 
months.  This  will  allow  us  to  study  the  protocols  in- 
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traduced  in  this  paper  in  a  more  realistic  setting.  In 
particular,  we  would  like  to  implement  the  optimistic 
protocol  of  section  7.1  and  study  its  performance  in  real 
applications. 

We  would  also  like  to  see  non-blocking  synchronizar 
tion  mzule  widely  available.  This  requires  an  efficient 
implementation  and  an  understanding  of  the  required 
programming  models  and  language  support.  For  exam¬ 
ple,  we  envision  a  threads  package  that  provides  perva¬ 
sive  support  for  non-blocking  operations.  Widespread 
experience  is  necessary  to  seriously  evaluate  the  practi¬ 
cality  and  convenience  of  non-blocking  synchronization. 


9  Conclusion 

Non-blocking  implementations  of  concurrent  objects  of¬ 
fer  some  important  advantages  over  lock-based  alterna¬ 
tives.  In  order  to  achieve  the  potential  benefit  of  non- 
blocking  synchronization,  we  must  solve  certain  per¬ 
formance  problems.  Among  these  are  the  costs  of  use¬ 
less  parallelism  and  unnecessary  data  copying.  We  have 
demonstrated  the  effect  of  these  problems  experimen¬ 
tally,  and  suggested  new  protocols  to  improve  the  per¬ 
formance  of  non-blocking  implementations. 

Our  protocols  are  based  on  features  of  real  multipro¬ 
cessor  systems  that  are  not  present  in  the  theoretical 
models  for  which  non-blocking  synchronization  was  de¬ 
veloped.  In  real  multiprocessors,  events  causing  signif¬ 
icant  delays  are  visible  to  the  operating  system;  our 
protocols  take  advantage  of  this  fact  by  relying  on  the 
operating  system  to  take  corrective  action  whenever  a 
thread  is  delayed.  Operating  system  support  allows  our 
protocols  to  detect  and  react  to  delays;  this  added  flexi¬ 
bility  allows  the  design  of  more  sophisticated  protocols. 
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Abstract 

We  propose  efficient,  programming  language-independent,  location-transparent  references  as  a 
substitute  for  pointers  in  distributed  applications.  These  references  provide  the  semantics  of 
normal  pointers  for  both  local  and  distributed,  transient  and  persistent  objects.  They  may  be 
passed  in  messages  between  and  within  nodes  using  a  low-overhead  presentation-layer  protocol. 
The  programmer  remains  free  to  create,  delete  or  migrate  objects  at  will.  Sending  references  (or 
migrating  objects)  may  cause  references  to  be  chained  together  across  any  number  of  spaces;  we 
provide  a  short-circuit  protocol  to  optimize  access  through  such  chains. 

Integrated  with  these  references,  we  provide  efficient,  distributed,  garbage  collection  of  acyclic 
data  structures.  Even  in  the  presence  of  network  failures  such  as  lost  messages,  duplicated 
messages,  out  of  order  messages  and  site  failures,  the  correctness  of  GC  is  guaranteed.  The 
protocol  assumes  the  existence  of  local  garbage  collectors  of  the  tracing  family.  The  protocol 
combines:  (i)  local  tracing  (from  a  conservative  root);  (ii)  conservative  distributed  reference 
counting;  (iii)  periodic  tightening  of  the  counts;  and  (iv)  allowance  for  messages  in  transit  during 
GC.  The  protocol  uses  only  information  local  to  each  site,  or  exchanged  between  pairs  of  sites; 
no  global  mechanism  is  necessary.  It  is  parallel  and  should  scale  to  very  large  systems,  e.g.  tens 
of  thousands  of  nodes  connected  using  both  local  and  wide-area  networks. 


1  Introduction 

Distributed  object-based  systems  are  of  ever  increasing 
importance,  providing  a  powerful  means  of  exploiting 
contemporary  hardware.  Writing  distributed  applica¬ 
tions  is,  however,  rarely  as  straightforward  as  writing 
local  ones.  A  key  reason  is  that  local  and  remote  ob¬ 
jects  are  generally  handled  differently:  a  plain  pointer 
may  be  used  when  a  local  object  is  created;  a  special 
handle  must  be  used  for  remote  objects.  Code  must 
therefore  be  aware  of  which  kind  of  object  it  manip- 
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ulates,  and  may  need  to  be  replicated  to  handle  both 
kinds  of  objects.  The  solution  to  this  problem  is  to  pro¬ 
vide  a  uniform  programming  model  for  both  kinds  of 
objects,  that  is,  to  provide  a  location-transparent  refer¬ 
ence  mechanism. 

We  propose  rtftnncts  as  a  substitute  for  pointers  to 
objects  in  distributed  applications.  References  provide 
uniform  identification  and  access  to  local  and  remote 
objects.  Methods  of  the  target  object  of  a  reference  can 
be  invoked;  references  can  be  passed  as  arguments  in 
invocations,  both  within  and  across  spaces.  Invocation 
of  a  referenced  local  object  turns  into  a  procedure  call 
through  a  pointer.  For  remote  objects,  the  invocation 
causes  the  parameters  to  be  marshalled  and  sent  to  the 
object  in  remote  procedure  call  (RPC)  messages. 

Intimately  associated  with  our  references  is  a  fault- 
tolerant  protocol  for  the  detection  of  acyclic  distributed 
garbage. 
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Some  programming  languages  provide  garbage  col¬ 
lection  (GC)  to  automatically  deallocate  inaccessible 
objects.  GC  is  extremely  useful,  as  it  simplifies  the  pro¬ 
gramming  model,  therefore  freeing  valuable  program¬ 
mer  time,  while  avoiding  bugs  and  memory  leaks  which 
are  notoriously  hard  to  diagnose.  However,  there  has 
been  relatively  little  work  on  GC  in  systems  supporting 
distribution.  System  designers  often  dismiss  GC,  viewed 
as  too  complex  and/or  too  costly  or  just  not  useful  for 
general  applications.  They  are  wrong  on  all  counts. 
The  bulk  of  distributed  GC  can  be  implemented  in 
a  generic,  language-independent,  low-overhead,  fault- 
tolerant  manner. 

Our  GC  protocol  is  inexpensive  in  that  it  requires  no 
extra  foreground  messages  or  system  activity;  further¬ 
more,  neither  additional  copying  nor  additional  inter¬ 
pretation  of  message  contents  is  required.  The  mecha¬ 
nism  is  safe  in  that  as  long  as  one  reference  to  an  ob¬ 
ject  exists  somewhere  in  the  distributed  system,  the  ob¬ 
ject’s  storage  will  not  be  reclaimed.  We  minimize  the 
administrative  message  overhead  by  piggy-backing  and 
batching.  The  protocol  is  fully  parallel  and  scales  well: 
the  space  complexity  is  proportional  to  the  number  of 
remote  references;  it  communicates  only  between  pairs 
of  spaces;  and  third-party  dependencies  are  avoided.  As 
the  only  information  used  is  local,  or  exchanged  between 
pairs-of-sites,  no  global  state  or  synchronization  is  nec¬ 
essary  and  neither  multicast  nor  ordered  protocols  are 
required. 

The  underlying  assumptions  are  weak  and  reason¬ 
able.  Messages  that  arrive  are  delivered  in  finite,  non¬ 
zero  time;  they  may  be  lost,  delivered  out  of  order, 
or  duplicated.  The  network  may  become  partitioned. 
Nodes  may  crash  silently.  Clocks  need  not  be  synchro¬ 
nized.  Local  activities  are  not  synchronized  together. 

When  application  code  (the  mutator  processes),  lo¬ 
cal  garbage  collectors,  and  a  distributed  collector  all 
execute  in  parallel,  it  becomes  difficult  to  guarantee 
consistency.  Some  published  distributed  GC  algorithms 
assume  strong  consistency.  In  contrast,  our  design  is 
based  on  weakening  these  assumptions.  Strict  consis¬ 
tency  comes  at  a  high  cost;  allowing  “safe”  inconsistency 
has  the  potential  of  greater  efficiency,  reliability  and 
availability.  For  example,  our  protocol  permits  a  space 
to  infer  that  a  remote  reference  to  a  local  object  exists 
when,  in  fact,  there  is  no  such  reference.  Other  appar¬ 
ent  inconsistencies  may  arise  due  to  messages  being  in 
transit.  Inconsistencies  which  do  not  violate  the  safety 
invariants  of  GC  are  harmless.  Our  mechanism  relies 
on  this  fact,  relaxing  the  conditions  that  usually  guar¬ 
antee  liveness  whilst  maintaining  safety  invariants.  This 
relaxation  (as  opposed  to  weakening)  permits  garbage 
to  accumulate  in  the  system.  Tightening  the  conditions 
then  directly  results  in  the  reclamation  of  the  garbage. 
This  process  of  relaxing  and  retightening  is  embodied 


in  a  straightforward  windowing  algorithm  based  on  un¬ 
synchronised  timestamps. 

This  paper  presents  our  protocol  abstractly.  Sec¬ 
tion  2  introduces  our  model  and  Section  3  then  provides 
a  detailed  description  of  the  algorithms  and  the  under¬ 
lying  data  structures.  After  that.  Section  4  analyzes  the 
complexity  of  the  protocol  in  terms  of  messages,  time 
and  space  overhead.  We  compare  this  protocol  to  sev¬ 
eral  others  in  Section  5. 

In  this  presentation  certain  key  features  of  the  mech¬ 
anism  are  highlighted  and  discussed  in  some  detail.  We 
omit  the  protocol  for  migrating  objects,  support  for  the 
deletion  of  non-garbage  objects,  and  persistence  mech¬ 
anisms;  these  are  described  elsewhere  [25].  We  do  not 
address  the  collection  of  cyclic  distributed  garbage  here; 
see  however  Section  5. 


The  distributed  universe  of  objects  is  subdivided  into 
disjoint  spaces}  It  is  assumed  that  a  space  can  be  iden¬ 
tified  unambiguously,  e.g.  by  a  UID^.  A  space  has  two 
possible  states.  It  may  be  operating  and  communicating 
normally;  or  it  may  terminate.  If  a  space  terminates  it 
does  not  reappear  and  its  name  is  never  reused.  If  the 
hardware  it  resides  on  crashes,  a  space  may  either  per¬ 
sist  (recover)  or  terminate.  A  space  may  also  appear 
to  cease  communicating  (disconnect),  due  to  network 
problems,  for  example,  or  during  temporary  overload 
or  recovery  after  a  crash;  eventually,  however,  such  a 
space  either  recovers  (reconnects)  or  terminates.  In  the 
case  of  a  crash,  our  model  does  not  specify  whether  the 
affected  space(s)  recover  or  terminate,  nor  how  such  ter¬ 
mination  is  notified;  one  could  postulate  the  existence 
of  an  external  “oracle” . 

We  assume  that  each  space  executes  a  local  garbage 
collector  (LGC)  of  the  tracing  family^.  An  LGC  exe¬ 
cutes  independently  of  the  activity  of  other  LGCs  and 
of  distributed  cleanup. 

Each  space  A  carries  a  timestamp  generator 
stamp j^().  The  timestamp  generators  need  not  be  syn¬ 
chronized  across  spaces. 

Finally,  each  space  also  carries  an  array  threshold  a 


’  We  use  the  abstract  term  “space” ,  rather  than,  for  instance, 
“host” ,  “node”  or  “process” ,  to  avoid  committing  to  a  particular 
implementation  or  lifetime. 

^  A  higher  level  of  distributed  GC  might  be  able  to  ensure  that 
a  space  name  is  not  reused  until  no  further  references  to  the  old 
space  exist;  this  is  beyond  the  scope  of  this  paper.  Therefore  we 
assume  that  all  space  names  are  unique. 

^Such  LGCs  are  standard  in  Lisp,  Smalltalk,  Eiffel  and  similar 
environments.  LGCs  are  also  being  developed  which  are  either 
language  independent  [S]  or  adapted  to  C  and  C++  (1,9].  IVacing 
GC  is  fault  tolerant,  in  that  each  execution  of  the  collector  is 
independent  of  all  previous  ones. 


2  Overview 
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of  timestamp  values  received  from  other  spaces  and  lim¬ 
iting  acceptable  messages. 

2.1  Objects  and  References 

Spaces  contain  passive  objects  consisting  of  instance 
data  and  associated  methods.  An  object’s  instance  data 
may  include  any  number  of  references  to  other  objects. 
A  reference  is  a  location-transparent  handle,  through 
which  methods  of  the  target  object  may  be  invoked.  A 
reference  may  be  passed  as  an  argument  and  thus  copied 
between  spaces.  Our  model  is  illustrated  by  Figure  1; 
which  eill  the  examples  in  this  text  refer  to. 

Inter-space  references  are  identified  in  special  data 
structures  in  the  source  and  destination  spaces.  A  space 
maintains  a  table  of  stubs  for  its  outgoing  remote  refer¬ 
ences.  There  is  never  more  than  one  stub  in  a  space  for 
a  given  object,  even  if  that  space  contains  many  refer¬ 
ences  to  that  object. 

Complementing  the  stubs,  each  space  maintains  a  ta¬ 
ble  of  scions,'*  which  track  incoming  remote  references. 
Every  stub  that  points  to  a  given  space  has  a  corre¬ 
sponding  scion  in  that  space.  A  scion  may  either  point 
to  a  local  object  or  a  further  stub;  this  permits  chaining 
of  remote  references.  Scions  are  conservative  in  that  the 
stub  corresponding  to  a  scion  may  no  longer  exist.  This 
inconsistency  does  not  lead  to  errors,  but  may  temporar¬ 
ily  prevent  some  garbage  from  being  reclaimed.  Note 
that  we  use  a  distinct  scion  for  each  stub,  unlike  other 
systems  in  which  multiple  ‘outgoing’  pointers  will  merge 
into  a  single  ‘incoming’  pointer. 

A  stub  contains  a  locator  composed  of  two  parts, 
called  strong  and  weak.  Each  part  indicates  a  scion 
and  consists  of  an  identifier  of  a  space  containing  the 
scion,  and  the  scion’s  name  valid  within  that  space.  The 
strong  part  identifies  the  scion  which  matches  the  con¬ 
taining  stub.  Distributed  garbage  collection  relies  on 
the  invariant  that  (in  the  absence  of  failures)  there  is  al¬ 
ways  an  uninterrupted  chain  of  stubs’  strong  parts  and 
scions  (hereafter  called  a  “strong  chain”)  between  the 
source  and  target  of  a  remote  reference.  The  weak  part 
identifies  a  scion  which,  while  part  of  the  same  chain  of 
strong  indicators,  may  be  closer  to  the  target  object®. 
Weak  parts  are  used  for  communication  and  location, 
which  rely  on  the  invariant  that  the  indicated  scion  will 
not  be  collected,  being  protected  by  the  strong  chain. 

Mutators  send  and  receive  messages  using  a  low- 
overhead  “presentation-layer”  protocol,  with  scions  and 
stubs  created  or  updated  automatically  as  needed.  The 
marshalled  form  of  a  reference  is  a  locator. 


^ Scion  n.  1.  A  descendant  or  heir.  2.  A  detached  shoot  or 
twig  containing  buds  from  a  woody  plant  and  used  in  grafting 
(16). 

®  Indeed,  if  the  target  object  does  not  migrate,  the  weak  part 
is  guaranteed  to  point  to  its  space  of  residence. 


2.2  Garbage  Collection 

In  discussing  garbage  collection  we  distinguish  the  mti- 
tator  and  collector  roles  [8].  Garbage  collection  poses 
three  distinct  problems;  distinguishing  references  from 
other  data  in  objects;  given  these  references,  detecting 
garbage  objects;  and  disposing  of  garbage  objects,  ac¬ 
cording  to  their  semantics.  The  former  and  latter  prob¬ 
lem  are  language-dependent  and  are  delegated  to  the 
local  garbage  collectors  (LGCs),  as  is  the  detection  of 
local  garbage.  Distributed  detection  is  independent  of 
object  structure  or  semantics  and  is  performed  by  our 
protocols.  During  local  garbage  collection  of  space  A, 
the  local  collector’s  root  set  is  augmented  to  consist  of 
the  local  roots  (noted  Ra)  and  the  local  scions. 

As  noted  in  the  introduction,  a  scion  may  exist  for 
which  the  corresponding  stub  no  longer  exists.  Thisjnay 
cause  the  local  garbage  collector  to  believe  that  there  is 
a  remote  reference  to  a  local  object,  whereas,  in  fact, 
none  exists.  For  this  reason,  garbage  collection  is  some¬ 
what  conservative.  The  protocols  ensure  that  eventu¬ 
ally  the  unreferenced  scion  will  be  reclaimed.  Then,  the 
objects  reachable  only  from  that  scion  become  garbage 
and  may  be  reclaimed  by  the  local  garbage  collector. 
In  our  scheme,  all  unreferenced  scions  (i.e.  unreachable 
and  not  on  a  distributed  cycle  of  garbage)  will  eventu¬ 
ally  be  reclaimed  (to  the  extent  that  the  LGC  is  itself 
exhaustive). 

3  Specification  of  the  Protocol 

We  distinguish  four  aspects  of  the  mechanism  and  high¬ 
light  novel  features  for  each.  Together,  these  offer  both 
robust  distributed  references  and  the  automated  collec¬ 
tion  of  acyclic  distributed  garbage: 

•  The  'IVansport  Protocol  describes  the  way  in  which 
messages  are  handled  (under  what  circumstances  is 
a  message  discarded,  for  example). 

•  The  Presentation  Protocol  describes  the  way  in 
which  references  are  marshalled  into,  and  unmar¬ 
shalled  from,  messages. 

•  The  Invocation  Protocol  details  the  way  in  which 
a  reference  is  used.  That  is,  how  locating  of  the 
target  object  interacts  with  the  activity  of  invoking 
its  methods. 

•  The  Cleanup  Protocol  covers  the  elimination  of 
data  structures  associated  with  garbage  remote  ref¬ 
erences. 

When  a  space  terminates,  rather  complex  recovery 
behaviour  may  be  required  of  other  spaces.  Rather  than 
including  details  of  this  in  each  of  the  four  protocols 
listed  above,  a  separate  section  addresses  Termination 
Recovery. 
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'  ^  we&k  locator 
-  strong  locator 
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local  pointer 

Figure  1:  Object  and  reference  nrodel 
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stub 


Table  1;  Scion  Data  Structure 


source^pace:  space.name 

the  name  of  the  space  heading 
matching  stub. 

target.object:  pointer 

a  pointer  to  the  object  (or  a  stnb 
if  the  target  is  not  local). 

stamp;  timestamp 

a  locally-generated  timestamp  to 
protect  in-transit  references 

Table  2;  Locator  Data  Structure 

strong^pace:  space-name 

Space  where  next  scion  in  chain 
resides 

strong.Jcion:  scionjiame 

Scion’s  name 

weak-space:  space.name 

A  space  closer  to  where  the  ob¬ 
ject  remdes 

weak-scion:  scionoiame 

Scion’s  name 

3.1  Data  structures 

These  mechanisms  manipulate  data  structures  of  three 
key  forms:  scions,  locators  and  stubs.  Suppose  y  is  lo¬ 
cated  in  space  C;  when  used  in  B,  a  reference  to  y  relies 
on  a  stub  ya  in  B  and  a  scion  e  in  C.  The  stub  contains 
a  locator  {C,c,C,c}  with  strong  and  weak  parts,  both 
of  which,  in  this  simple  case,  indicate  scion  c  on  space 
C.  Scion  c  itself  hol<b  a  pointer  to  y. 

Scions  contain  a  space-identifier  (indicating  the  sin¬ 
gle  remote  space  which  potentially  contains  the  match¬ 
ing  stub),  a  locally  generated  timestamp  (produced 
when  a  reference  which  relies  on  this  scion  was  last  lo- 
caUy  marshalled),  a  pointer  to  the  object  (or  to  a  stub  if 
the  object  is  remote).  Scions  are  accessed  in  four  modes: 
(i)  for  invocation  and  location,  an  individual  scion  is  ac¬ 
cessed  directly,  via  a  scion  name  included  in  the  weak 
location  of  a  stub;  (ii)  by  enumeration  of  the  local  scions 
that  are  associated  with  a  given  remote  space;  (iii)  by 
enumeration  of  local  scions  that  point  at  some  particular 
local  object  or  stub;  or  (iv)  by  enumeration  of  all  local 
scions.  The  enumeration  modes  are  used,  respectively, 
by  the  Cleanup  Protocol,  by  reference  marshalling,  and 
by  the  locid  garbage  collector.  The  scion  data  structure 
is  documented  in  Table  1. 

Locators  are  the  marshalled  form  of  r^erences  and 
are  also  the  primary  components  of  stubs.  Locators  are 
documented  in  Table  2. 

Stubs  contmn  a  locator  and  a  timestamp.  Stubs  are 
accessed  in  three  ways;  (i)  invocation  proceeds  directly, 
through  a  local  pointer  to  the  stub;  (ii)  when  a  refer¬ 
ence  is  unmarshalled  from  a  message,  it  is  compared 
agiunst  existing  stubs  (for  unicity,  and  in  case  update 
of  the  weak  part  is  needed)  by  strong  locator;  (iii)  the 


Cleanup  Protocol  enumerates  all  local  stubs  containing 
a  strong  part  that  point  at  a  given  remote  space.  Stubs 
ate  documented  in  Table  3. 

A  locator  contains  both  strong  and  weak  parts.  Un¬ 
like  most  similar  mechanisms,  both  parts  do  lead  (pos¬ 
sibly  indirectly)  to  the  target  of  the  reference,  without 
any  global  search.  There  is  always  an  uninterrupted 
chain,  embodied  in  the  strong  locator  parts,  of  stubs 
and  scions  from  primary  source  to  ultimate  destination. 
The  proof  that  the  garbage  collector  is  safe  relies  on  this 
invariant. 

Finally,  each  space  A  contains  a  table  of  received 
timestamps,  threshold /i,  indexed  by  space  identifier,  ex- 
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Table  3:  Stub  Data  Structure 
location;  locator  Locator  for  the  target  object 

stamp;  timestamp  Timestamp  to  guard  against  race 

conditions 


pleuned  in  the  next  section. 

3.2  Transport  Protocol 

Communication  between  mutators  in  different  spaces 
occurs  via  messages  that  are  timestamped  using  the 
timestamp  generator,  stamp gQ,  of  the  message’s  source 
B. 

At  the  destination  A,  the  message  timestamp  is  com¬ 
pared  with  the  B  timestamp  held  in  threshold a[B\.  only 
messages  containing  timestamps  generated  later  than 
the  threshold  are  accepted.  This  eliminates  a  race  con¬ 
dition  explained  in  Section  3.5.1. 

Our  approach  usually  permits  messages  to  be  pro¬ 
cessed  out-of-order.  Whenever  message  ordering  could 
be  critical,  the  required  synchronization  is  explicitly 
supported  using  either  call-response  message  pairs  or 
the  threshold  table.  Some  delayed  messages  might  be 
treated  as  lost,  but  only  in  circumstances  in  which  it  is 
possible  that  acting  upon  them  would  violate  the  GC 
invariants. 

As  our  mechanisms  are  designed  to  tolerate  message 
loss,  reordering  and  duplication,  it  is  acceptable  for  the 
message-passing  protocol  to  use  cheap,  unreliable  trans¬ 
port  protocols.  If  an  application  uses  a  more  reliable 
protocol,  this  causes  no  difficulties  as  the  necessary  con¬ 
ditions  for  this  scheme  are  retained.  The  background 
messages  required  by  the  cleamup  protocol  may  also  use 
an  unreliable  protocol,  even  when  the  application  itself 
requires  a  more  reliable  mechamism. 

3.3  Presentation  Protocol 

A  side-effect  of  marshalling  a  reference  into  a  message  is 
to  produce  a  scion.  There  can  only  be  a  single  scion  per 
target  object  and  per  referring  space.  The  marshalling 
code  first  searches  for  such  a  scion;  if  none  exists  it  is 
created;  it  is  timestamped  with  the  current  timestamp. 

The  marshalled  form  of  a  reference  is  a  locator,  the 
strong  part  of  which  names  a  scion  in  the  sender’s  space. 
The  weak  part  names  a  scion  closer  to  the  object  (but 
on  the  same  strong  chain),  if  the  sender  knows  one;  oth¬ 
erwise  it  is  equal  to  the  strong  part.  These  parts  both 
permit  the  corresponding  remote  scion  to  be  unambigu¬ 
ously  identified.  The  message  is  timestamped  with  the 
same  timestamp  as  used  to  create  the  scion(s)  it  refers 
to. 


At  the  receiver,  unmarshsdling  the  reference  produces 
a  pointer  to  a  single  local  stub  per  matching  scion.  The 
actions  are;  search  for  a  stub  with  the  same  strong  lo¬ 
cator;  if  one  exists  and  its  timestamp  is  less  than  the 
message’s,  then  copy  the  weak  part  2tnd  the  timestamp 
from  the  message  (if  none  exists,  create  one  from  the 
weak  part  and  the  timestaunp  in  the  message);  pass  up 
the  address  of  this  stub  to  the  application. 

Since  marshalling  and  unmarshalling  are  necessary 
for  remote  communication,  our  approach  adds  negligible 
overhead  other  than  creation  of  stubs  and  scions,  while 
ensuring  no  duplicates.  Care  is  taken  to  ensure  unique¬ 
ness  of  the  stub-scion  pair  referring  to  a  particular  ob¬ 
ject  between  two  spaces.  This  has  a  small  associated 
cost  (discussed  in  Section  4.2.2),  because  it  requires  ad¬ 
ditional  indexing  mechanisms  and  searches,  but  it  ren¬ 
ders  scion  and  stub  creation  idempotent.  Hence,  dele¬ 
tion  of  a  stub  permits  the  corresponding  scion  to  be 
discarded  without  fear  that  another  stub  may  depend 
on  that  scion. 

The  construction  of  both  stubs  and  scions  is  conser¬ 
vative.  The  endpoints  of  the  remote  reference  are  cre¬ 
ated  without  knowing  whether  they  will  be  useful,  and 
stubs  are  always  created  after  their  matching  scions. 
For  instance,  it  may  occur  that  a  message  containing  a 
reference  is  lost;  in  this  case,  a  scion  has  been  created 
without  a  corresponding  stub.  It  may  also  occur  that  a 
received  reference  is  actually  ignored  by  the  mutator;  in 
this  case  the  whole  reference  chain  is  useless.  The  stub 
euid  scion  code  can  only  add  new  stubs  and  create  more 
scions.  This  is  consistent  with  the  view  that  mutators 
only  allocate  objects,  whereas  deallocation  is  performed 
transparently  by  the  collector.  Here  the  LGCs  remove 
unreferenced  stubs,  and  the  cleanup  protocol  removes 
unreferenced  scions. 

3.4  Invocations  and  Short-Circuiting 
Indirect  References 

A  reference  is  typically  used  to  invoke  some  procedure 
(or  method)  of  the  target  object.  Remote  invocation 
uses  a  call-response  protocol^. 

3.4.1  Indirect  References 

Liberal  use  of  indirect  reference  chains  allows  a  reference 
to  be  passed  cheaply  in  messages,  while  retaining  use¬ 
ful  invariants  (i.e.  the  guarantee  of  reachability).  But, 
when  considering  communication,  they  are  harmful;  not 
only  because  of  the  overhead  and  poor  locality,  but  more 
fundamentally  because  sending  a  reference  along  an  in¬ 
direct  chain  creates  yet  another  indirect  chain. 


’Out  initial  specification  [25]  was  based  on  one-way  messages. 
A  call-response  protocol  considerably  simplifies  short-circuiting 
indirect  references. 
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Consider,  for  example,  an  invocation  on  y,  made  by 
X,  passing  a  reference  to  y  itself  as  an  argument.  On 
receipt  of  the  invocation  at  C,  assuming  the  strong  lo¬ 
cator  chain  was  used,  the  argument  will  be  indicated 
by  a  chain  of  stub-scion  pairs  starting  in  C  and  running 
through  5,  A,  B  again  and  back  to  C.  If  this  same  refer¬ 
ence  is  returned  as  a  result,  matters  deteriorate  further. 

3.4.2  Short-Circuiting  an  Indirect  Reference 

For  these  reasons,  the  weak  locator  chain  is  used 
for  invocations’^  and  the  strong  chain  is  lazily  short- 
circuited  in  a  safe  fashion  as  a  side-effect  of  such  invoca¬ 
tions.  Obsolete  indirect  stubs  and  scions  will  be  cleaned 
up  later  by  the  garbage  collector.  Furthermore,  an  in¬ 
direct  chain  is  short-circuited,  at  the  latest  before  the 
harm  indicated  above  may  occur,  i.e.  before  allowing  ex¬ 
ecution  of  an  invocation  carrying  reference  arguments. 

There  are  two  sub-cases  to  consider.  The  easiest  case 
is  when  the  caller’s  weak  locator  is  exact,  indicating  the 
scion  closest  to  the  target.  The  other  case  is  when  the 
harmful  effect  indicated  above  could  occur  (the  weak 
locator  is  inexact,  and  at  least  one  argument  is  a  refer¬ 
ence).  The  intermediate  case  (inexact  weak  locator  but 
no  reference  argument)  can  be  treated  in  either  way. 

3.4.3  Weak  Locator  Exact 

In  Figure  1,  suppose  x  calls  y.f(),  i.e.  invokes  some 
method  of  y  with  no  argument.  The  call  message  is 
sent  to  weak  location  {C,  c).  Upon  receipt  of  a  call,  at 
scion  c,  from  space  A  other  than  the  space  containing 
the  matching  stub  (S),  a  new  scion  c"  (which  does  not 
appear  in  the  figure)  is  created.  Locator  {C,  c",  C,  c"} 
is  piggybacked  on  the  invocation  results,  and  the  stub 
yA  at  the  invoker’s  space  A,  through  which  the  invoca¬ 
tion  was  made,  is  updated  to  locator  {C,  c",  C,  c"}.  As 
the  original  chain;  yA  —*  {b,  B)  ye  {c,  C}  — ♦  y 
remains  in  place  until  the  y>t  stub  contents  are  changed, 
the  GC  invariants  are  maintained.  The  superseded  in¬ 
direct  chain  (the  scion  {B,b}  and,  possibly®,  the  stub 
yB  and  the  scion  {C,  c})  becomes  garbage  and  will  be 
collected  at  some  later  time. 

3.4.4  Weak  Locator  Inexact 

Consider  now  x  invoking  t.g{p),  some  method  g  of  <  with 
a  reference  argument  (for  which  a  scion  a  is  allocated). 
Since  the  weak  locator  indicates  C  as  above,  the  call 
message  is  similar. 

Scion  c',  receiving  this  message,  detects  that  x  is  not 
local  because  it  points  to  a  stub.  The  message  is  passed 


objects  do  not  migrate,  the  weak  locator  is  guaranteed  to 
indicate  the  space  containing  the  target  object. 

*  Depending  on  whether  they  also  form  part  of  other,  extant, 
references  at  B. 


(without  unmarshalling)  to  <c.  which  forwards  it  on  to 
its  own  weak  locator,  i.e.  {D,d}.  Here  we  notice  that 
the  target  is  local  but  that  the  caller’s  weak  locator  was 
inexact  and  the  argument  is  a  reference.  Therefore  the 
call  is  aborted  and  a  location.exception  is  signalled  back 
to  A  with  an  up-to-date  location.  For  this,  a  new  scion 
{D,  d'},  pointing  to  f,  is  allocated  for  use  by  A. 

Upon  receiving  the  exception,  Ia  updates  its  locator 
to  point  to  {£),  d'}.  It  then  retries  the  invocation  (a  new 
scion  a'  pointing  to  p  on  behalf  of  D  must  be  allocated). 
Now  scion  {A,o}  can  be  collected,  as  well  as  the  old 
indirection  chain  b'  Ib  —*  c'  —*  tc  d. 

3.5  Collector  Protocol 

Above  we  have  specified  the  mutator  protocol.  Now  we 
will  specify  the  collector,  i.e.  actions  performed  inde¬ 
pendently  of  the  mutator’s  execution,  in  order  to  collect 
garbage.  This  involves  two  independent  activities:  local 
garbage  collection  and  the  distributed  cleanup  protocol. 
Reclamation  of  unreachable  stubs  and  scions  is  tricky, 
because  of  the  possibility  of  lost  messages,  and  of  race 
conditions. 

3.5.1  Local  Garbage  Collection 

The  LGC  traces  references  from  the  local  root  and  the 
set  of  all  local  scions.  An  unreachable  stub  is  garbage, 
and  can  be  collected.  However  there  is  a  possible  race 
condition  with  messages,  containing  the  same  reference, 
arriving  late. 

The  race  condition  is  eliminated  by  the  following 
rule.  Before  discarding  a  stub  in  A,  the  strong  locator  of 
which  points  to  space  B,  threshold a[B]  is  increased  to 
the  value  in  the  stub’s  timestamp,  causing  the  transport 
protocol  at  A  to  drop  earlier  messages  from  B. 

3.5.2  Cleanup  Protocol 

As  was  noted  in  the  introduction,  the  mutator  protocol 
relaxes  the  GC  liveness  condition,  and  the  cleanup  pro¬ 
tocol  occasionally  strengthens  it:  the  strongest  form  is 
that  every  scion  has  a  single  matching  stub,  the  weak¬ 
ened  form  permits  some  scions  to  have  no  matching 
stub. 

Signaling  to  a  scion  that  the  matching  stub  has 
been  collected  is  complicated  by  two  potential  problems. 
First,  the  “deletion”  message  could  be  lost.  To  tolerate 
message  loss,  lists  of  stubs  which  were  still  live  at  some 
time  are  sent,  rather  than  sending  deletion  messages. 
Second,  messages  are  asynchronous,  leading  to  possible 
race  conditions  between  scion  update  and  deletion.  To 
avoid  the  race  condition,  a  scion  is  removed  only  if  it 
is  both  unreachable  and  there  is  no  message  in  transit 
which  may  make  it  reachable  again.  (This  race  condi¬ 
tion  is  different  from  the  one  in  the  previous  section.) 
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Thus  a  spetce  A  will  periodically  send  to  some  other 
space  B  a  message  (called  a  live  message)  containing:  (i) 
the  list  of  scion  names,  taken  from  the  strong  locators 
of  edl  extant  stubs  at  A  that  point  at  scions  at  B,  and 
(ii)  the  value  threshold a[B]. 

This  permits  the  receiver  B  to  deduce  what  scions 
in  B  are  unreachable:  precisely  those  for  which  there 
is  no  matching  stub.  An  unreachable  B  scion  can  be 
removed,  if  and  only  if  the  following  condition  holds  for 
it  at  B: 

scion. stamp  <  messoye .threshold 

i.e.  there  are  no  recent  messages  in  transit  carrying  its 
location,  which  could  make  it  reachable  again. 

Essentially,  we  have  made  stub  and  scion  deletion 
an  idempotent  operation  and  eliminated  the  race  con¬ 
ditions  by  ignoring  messages  arriving  “too  late”  and 
by  never  discarding  scions  that  have  recently  updated 
timestamps.  Scion  timestamps  protect  against  deletion 
of  scions  for  which  a  usable  reference  may  be  in  transit, 
whereas  stub  timestamps  protect  against  re-creation  of 
stubs  for  which  scions  have  been  discarded. 

To  ensure  that  progress  is  made,  one  space  may 
prompt  another  to  report  on  the  stubs  it  holds.  A  space 
B  may  send  a  background  prompt  message  to  another 
space  A.  On  receipt  of  such  a  message  the  live  message 
is  sent  and  processed  as  indicated  above. 

3.6  Termination  Recovery 

In  this  section  the  problems  arising  when  a  space  ter¬ 
minates,  are  addressed.  We  define  space  termination 
to  mean  that  its  local  root  is  deleted,  as  well  as  all 
objects  it  contains.  (References  to  the  deleted  objects 
are  detectably  dangling;  an  attempt  to  invoke  the  tar¬ 
get  will  raise  an  exception.)  Indirection  chains  through 
the  terminating  space  must  first  be  resolved  and  short- 
circuited.  These  rules  are  easy  to  enforce  when  a 
space  voluntarily  terminates  itself;  in  other  cases  a  pro¬ 
tocol  is  needed  to  achieve  the  same  observable  effect. 
There  are  two  aspects:  ensuring  communication,  and 
re-establishing  the  invariants. 

3.6.1  R.ecovering  Communication 

Invocation  uses  the  weak  locator  of  the  sender’s  stub, 
and  therefore  is  not  impaired  by  a  break  in  the  strong 
chain  only.  In  fact,  if  the  target  does  not  migrate,  the 
weak  locator  holds  its  actual  location. 

If  objects  do  migrate,  then  the  weak  locator  may 
point  to  an  intermediate  scion,  such  as  tA  to  c'  in  Fig¬ 
ure  1 .  If  the  weakly  located  scion  is  lost,  this  also  breaks 
the  strong  locator  chain.  Here  the  only  possibility  is 
to  search  exhaustively  for  the  object.  Such  a  search 
is  expensive  and  may  be  prohibitive  in  large  systems. 


so  a  structuring  of  the  system  into  collections  of  spaces 
with  intervening  non-terminating  gateways  would  be  re¬ 
quired. 

Furthermore,  global  search  assumes  that  the  holder 
of  a  stub  knows  some  unique  feature  of  the  sought-after 
object.  Since  we  don’t  assume  UIDs,  the  unique  feature 
will  be  the  weak  locator  part.  Thus  when  an  object 
migrates,  its  new  scion  must  carry  with  it  the  list  of 
scion  names  under  which  it  had  been  previously  known. 
This  list  will  be  discarded  as  a  side-effect  of  discarding 
the  scion  in  the  short-circuit  protocol  of  Section  3.4.2. 

3.6.2  Re-Establishing  the  Invariants 

Initially  it  might  appear  that  termination  of  a  space  that 
contains  a  stub  is  of  little  consequence.  Unfortunately, 
it  is  an  error  to  simply  discard  the  matching  scion. 

The  safety  of  garbage  collection  depends  on  the  in¬ 
variant  that  an  uninterrupted  strong  chain  exists  be¬ 
tween  the  source  and  truget  of  a  reference.  In  turn,  the 
invocation  protocol  depends  on  the  fact  that  the  scion 
pointed  by  a  weak  locator  will  not  be  collected. 

We  are  contemplating  a  number  of  possible  solu¬ 
tions.  The  simplest  is  to  retain  forever  all  scions  whose 
source-space  has  not  reliably  short-circuited  indirections 
through  it. 

Our  preferred  solution  improves  over  the  above,  by 
relying  on  the  existence  of  a  global  garbage  collector 
(which  is  necessary  anyway  to  collect  distributed  cycles 
of  garbage,  since  they  are  not  removed  by  the  protocol 
presented  in  this  paper)  to  detect  and  remove  garbage 
scions  retained  in  this  way.  This  is  the  approach  taken 
by  Dickman  [7]  and  means  that  the  algorithm  is  no 
longer  live,  but  remains  efficient  and  effective. 

An  alternative  would  be  to  apply  a  rule,  similar  to 
the  rule  for  objects,  to  broken  chains;  any  reference 
chain  indirecting  through  a  terminated  space  is  deemed 
dangling.  Such  an  approach  is  only  correct,  however,  if 
scion  identifiers  are  never  reused,  as  otherwise  a  differ¬ 
ent  problem  of  erroneous  chain  following  is  introduced. 
It  has  the  drawback  that  an  object  may  become  un¬ 
reachable  by  some  path,  which  happened  to  go  through 
a  terminated  space,  and  remain  reachable  by  others: 
such  inconsistencies  are  undesirable. 

Yet  another  solution  maintains  liveness  and  avoids 
further  errors,  but  is  expensive,  involving  a  large-scale 
search.  On  discovering  such  a  stub-less  scion  a  message 
can  be  passed  down  the  remaining  chain  to  the  object 
concerned®.  The  message  has  inserted  in  it  the  space 
and  scion  identifiers  for  every  scion  encountered  during 
the  message’s  journey.  Having  thus  collected  a  list  of  all 
scions  that  may  be  indicated  by  detached  sections  of  the 


^If  the  chain  is  broken  in  two  places  the  mid-section  will  be 
recovered  first  and  this  will  then  recover  the  association  with  the 
most  detached  part. 
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chain,  an  exhaustive  search  is  performed.  Each  space  in 
turn  is  presented  with  the  list  of  intermediate  scions 
and  requested  to  update  any  and  all  relevent  locators. 
Again  the  cost  of  global  searches  should  be  limited  by 
a  hierarchic  structuring  of  spaces. 

Whatever  solution  is  chosen,  the  window  of  vulner¬ 
ability  can  be  narrowed  by  aggressively  short-circuiting 
indirections.  When  a  stiong  and  weak  locator  disagree, 
a  dummy  invocation  is  made,  thereby  updating  the  lo¬ 
cators.  This  immediately  solves  the  problem  if  objects 
cannot  migrate,  at  a  cost  in  additional  messages.  The 
dummy  invocation  can  be  performed  either  immediately 
upon  receiving  a  reference,  or  after  a  time-out,  or  by  a 
bakcground  daemon. 

4  Analysis 

A  proof  of  the  safety  of  the  algorithm,  and  a  discussion 
of  the  circumstances  under  which  it  exhibits  liveness, 
are  in  preparation.  Essentially,  it  is  shown  that  the  al¬ 
gorithm  can  never  collect  non-garbage  objects  since  the 
scions  form  a  superset  of  the  existing  stubs  and  act  as 
local  roots  for  the  LGC.  The  timestamp  windows,  as 
represented  by  the  local  threshold  vectors,  aire  funda¬ 
mental  to  the  proof,  given  the  possibility  of  messages 
being  in-transit  or  duplicated.  Furthermore,  for  exim- 
ple,  if  disconnections  do  not  occur  after  some  initial  in¬ 
terval,  it  is  shown  that  all  acyclic  garbage  is  eventually 
collected  by  the  algorithm  (note  that  this  is  stronger 
than  claiming  that  the  algorithm  is  live  in  the  absence 
of  disconnections). 

Our  references  require  no  foreground  messages  other 
than  the  ones  sent  by  the  application.  Local  processing 
and  memory  costs  appear  acceptable.  In  addition  to  tol¬ 
erating  non-byzantine  failures,  the  protocols  described 
do  not  require  any  form  of  global  synchronization  or 
snapshot,  nor  are  third  parties  depended  upon  in  any 
way.  As  all  of  the  costs  incurred  are  associated  with 
the  handling  of  references,  and  the  background  activ¬ 
ities  need  only  commence  once  a  reference  is  used,  no 
overhead  is  imposed  on  applications  which  choose  not 
to  use  our  mechanism. 

4.1  Failures 

The  failure  model  used  in  this  work  is  slightly  richer 
than  in  most  comparable  material.  Messages  may  be 
duplicated  as  well  as  lost  or  delivered  out-of-order.  Pro¬ 
cessor  pairs  are  subject  to  periods  during  which  commu¬ 
nication  between  them  may  be  impossible;  however,  it 
is  not  assumed  that  this  failure  is  either  symmetric  or 
transitive.  Processors  are  fail-stop.  It  is  assumed  that 
messages  are  not  undetectably  corrupted  and  that  each 
timestamp  generator  produces  increasing  values. 


A  particular  emphasis  has  been  placed  on  message- 
related  failures  in  this  presentation,  as  they  are  most 
naturally  and  coherently  integrated  into  our  protocols. 
The  support  for  recovery  when  some  sp2ice  terminates 
is  more  complex  and  was  presented  separately.  Over¬ 
all,  these  mechanisms  permit  the  collection  of  acyclic 
garbage  in  an  environment  that  is  rather  more  demand¬ 
ing  than  those  postulated  by  most  other  approaches. 

4.2  Costs 

The  costs  of  the  protocol  are  considered  according  to 
three  different  measures:  in  terms  of  messages,  CPU 
time  and  memory  space.  To  provide  a  baseline  against 
which  comparisons  can  be  made,  consider  the  state-of- 
the  art  implementation,  based  on  UIDs  or  capabilities, 
supporting  network  transparency,  but  not  garbage  col¬ 
lection.  This  system  would  support  messages  and  times¬ 
tamps.  Data  structures  would  require  marshalling  and 
unmarshalling. 

This  analysis  omits  the  cost  of  recovery  after  a  crash. 

4.2.1  Messages 

An  important  feature  of  this  algorithm  is  that  it  re¬ 
quires  no  additional  foreground  messages.  Additional 
messages  do,  however,  arise  in  the  background  as  a  con¬ 
sequence  of  the  cleanup  protocol. 

The  marshalled  form  of  a  reference,  as  held  in  mes¬ 
sages,  consists  of  a  locator.  V  stock  hardware  is  used,  a 
marshalled  reference  is  therefore  around  16  bytes  long. 
This  is  comparable  to  the  size  of  UIDs  in  many  systems. 
In  both  our  system  and  the  minimal  system,  messages 
are  timestamped. 

Since  a  UID  is  location-independent,  locating  its  tar¬ 
get  entails  a  distributed  search  algorithm.  In  the  worse 
case,  a  reliable  global  search  is  needed.  Maintaining  a 
location  caurhe  for  recently-used  UIDs  allows  to  amor¬ 
tize  the  cost  of  the  search.  There  is  no  such  cost  with 
locators. 

4.2.2  Local  CPU  Time 

Reference  mau^halling  and  unmarshalling  require 
searches  for  existing  scions  and  stubs,  prior  to  creating 
new  ones.  A  similar  cost  arises  when  short-circuiting 
indirect  references.  Passing  a  UID  is  typically  much 
simpler,  involving  a  simple  copy  into  the  message. 

A  UID  system  bears  a  cost  searching  through  its 
cache  for  the  location  of  a  message’s  destination.  The 
processing  involved  is  somewhat  simpler  than  our  mar¬ 
shalling  and  the  cost  is  borne  only  once  per  message. 

Finally,  we  use  additional  CPU  time  in  executing  the 
local  garbage  collector,  and  in  interpreting  the  live  mes¬ 
sages  of  the  cleanup  protocol.  We  perform  these  ac- 
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tivities  in  the  background,  however,  so  the  impact  on 
application  performance  is  minimal. 

All  the  costs  listed  above  are  local.  A  space  never 
needs  to  wait  for  another  any  more  than  required  by 
the  mutator. 

4.2.3  Memory 

The  per-space  memory  costs  of  this  approach  depend  on 
the  degree  of  locality  exhibited  by  the  applications  exe¬ 
cuted  in  the  system.  Three  major  data  structure  types 
are  required:  the  threshold  vector,  stubs  and  scions. 

The  following  estimates  of  memory  usage  cw  be 
made.  Assuming  for  simplicity  of  analysis  that  times¬ 
tamps,  space-identifiers,  scion  names  and  local  pointers 
each  occupy  four  bytes,  and  that  all  indirection  chains 
and  garbage  have  been  eliminated: 

•  The  threshold  vector  requires  one  entry,  of  8  bytes, 
for  each  known  remote  space. 

•  There  is  a  single  stub  for  each  remote  object  that 
is  locally  referenced,  requiring  24  bytes. 

•  There  is  a  single  scion  per  object  for  each  remote 
space  that  contains  references  to  that  object;  it  oc¬ 
cupies  20  bytes. 

In  addition,  hash  tables  or  the  like  are  required  to 
implement  the  accesses  modes  listed  in  Section  3.1. 
These  costs  are  not  unreasonable:  a  maximum  of  8-b24-+- 
20  =  52  bytes  per  remote  reference,  across  the  system 
as  a  whole,  compues  unfavourably,  but  not  appallingly, 
with  the  cost  of  16-byte  UIDs  supported  by  a  location 
cache.  Some  hotspots  may  arise,  however,  if  particular 
well-known  objects  are  referenced  from  a  great  many 
remote  spaces,  due  to  the  accumulation  of  scions. 

4.3  Measured  Performance 

We  have  prototyped  an  earlier  version  of  our  protocol, 
called  SGP  [25],  on  the  distributed  Lisp  Transpive  [18]. 
This  version  lacks  weak  locators,  uses  a  message  proto¬ 
col  rather  than  call-reply  and  uses  an  extra  timestamp 
vector  instead  of  timestamping  stubs.  A  detailed  ac¬ 
count  and  analysis  of  this  experiment  may  be  found  in 
[19]. 

For  our  evaluation,  we  replaced  Piquer’s  original  dis¬ 
tributed  Indirect  Reference  Count  (IRC)  collector.  Our 
protocol  provides  the  same  functionality  as  IRC,  and  is 
furthermore  scalable  and  resilient  to  message  and  space 
failures. 

In  this  section,  we  compare  the  measured  perfor¬ 
mance  of  our  prototype  with  IRC,  in  terms  of  com¬ 
munication  and  CPU  overhead.  Our  measurements  of 
two  applications  (merge  sort  and  matrix  multiplication) 
were  taken  on  a  Parsytec  board  composed  of  four  T800 
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Application 

rable  5:  Message  Overhead 

Control  Messages 

IRC  SGP  IRC  -  SGP 

(sort  100) 

31  28 

10  8 

21  20 

(sort  200) 

41  39 

10  8 

31  31 

(mult  20  20) 

101  96 

20  18 

81  78 

Transputers  with  one  megabyte  of  memory  each,  hosted 
in  a  Sun.  Each  application  is  timed  twice  in  a  row;  the 
figures  are  better  the  second  time  because  of  TTanspive’s 
caching  policy.  The  measurements,  repeated  dozens  of 
times,  have  shown  extremely  low  variance.  Our  exper¬ 
iments  were  able  to  test  resilience  to  message  loss  but 
not  to  termination,  due  to  lack  of  a  fault- tolerant  appli¬ 
cation.  Furthermore,  we  were  not  able  to  quantify  how 
conservative  or  how  scalable  our  protocol  is. 

Table  4  shows  local  execution  times.  The  overhead 
is  due  to  management  of  (the  Transpive  equiv2dent  of) 
stubs  and  scions.  Our  implementation  is  on  average 
10%  slower  than  IRC  and  20%  slower  than  with  dis¬ 
tributed  collection  turned  off.  This  result  is  encourag¬ 
ing:  our  implementation  is  not  optimized  and  retained 
some  obsolete  data  structures  and  processing  from  Pi¬ 
quer’s  implementation.  Furthermore  our  protocol  does 
more  than  IRC. 

Table  5  measures  message  overhead.  IRC  sends 
“delete”  messages,  whereas  we  periodically  send  live 
messages.  This  buffering  reduces  dramatically  the  num¬ 
ber  of  control  messages. 

Although  our  object  model  does  not  take  replication 
into  account,  it  was  necessary  for  Transpive;  it  proved 
quite  easy  to  eidd.  But  Lisp’s  extremely  fine  granularity 
of  objects  is  very  demanding,  requiring  a  huge  number  of 
stub  and  scions  which  consume  a  lot  of  space,  increasing 
the  garbage  collection  overhead. 

5  Related  Work 

This  section  compares  our  proposal  with  related  work, 
in  the  two  areas  of  location-independent  references  and 
distributed  garbage  collection. 

5.1  Location-Independent  References 

Many  distributed  systems  [17,  21,  24]  rely  on 
fixed-length,  location-independent  Universal  IDentifiers 
(UIE>s)  to  designate  and  locate  objects  throughout  the 
network.  UIDs  do  not  scale  well.  Uniqueness  can  be 
guaranteed  only  within  some  domain;  cross-domain  ref¬ 
erences  require  a  separate  mechanism.  A  UID  does  not 
carry  location  information;  locating  its  target  entails  a 
global  search  in  the  general  case.  Furthermore,  UIDs 
are  not  pointers,  forcing  programmers  to  use  two  very 
different  mechanisms. 
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Table  4:  Execution  Times 


Application 

CPU  time  in  seconds 

No  DGC  IRC  SGP 

Overhead 

SGP/IRC 

(sort  100) 

3.8 

3.2 

4.7 

3.9 

5.5 

4.1 

17%  5% 

(sort  200) 

5.6 

4.4 

6.7 

5.2 

8.1 

5.9 

20%  12% 

(mult  20  20) 

11.1 

7.8 

12.1 

8.7 

13.5 

9.8 

11.9%  12.3% 

Our  reference  mechanism  owes  much  to  the  links  of 
Demop/MP  [20]  amd  the  forwarders  of  Emerald  and  Her¬ 
mes  [3,  4).  In  contrast  to  their  proposals,  our  references 
are  intimately  associated  with  GC.  This  is  made  possi¬ 
ble  by  the  invariants  maintained  by  our  protocol. 

Fowler  [10]  proposes  chaining  forwarders  to  provide 
continuous  access  to  highly  mobile  objects.  Fowler  ana¬ 
lyzes  three  alternative  location  protocols  (distinguished 
in  how  they  short-circuit  indirection  chains),  demon¬ 
strating  that  the  cost  decreases  dramatically  when  the 
number  of  accesses  increases  faster  than  the  number  of 
moves.  His  Jacc  protocol  bears  some  similarities  with 
ours.  Our  stubs  carry  more  information  than  Fowler’s 
forwarders;  for  example,  our  weak  locator  accesses  a  non 
mobile  object  in  a  single  hop,  whereas  a  forwarder  is, 
in  effect,  a  strong  locator.  In  Fowler’s  design,  a  highly 
mobile  object  may  inform  others  of  its  current  location, 
requiring  something  similar  to  the  source  space  infor¬ 
mation  in  our  scions. 

5.2  Distributed  Garbage  Collection 

One  important  problem  of  distributed  garbage  collec¬ 
tion  is  maintaining  the  consistency  of  scions  with  stubs 
in  the  face  of  failures.  A  common  approach  is  to  use  re¬ 
liable  mechanisms  to  enforce  strong  consistency,  which 
is  expensive.  A  more  recent  approach  is  to  relax  tradi¬ 
tional  GC  invariants. 

Our  scheme  is  based  on  the  latter  alternative  and 
bears  similarities  to  some  proposals  based  on  reference 
counting  [6,  18].  Unlike  those  approaches,  however,  a 
scion  is  maintained  per  source  space,  which  permits  us 
to  tolerate  message  loss  while  avoiding  the  dangers  of 
duplicated  delete  messages. 

Dickman  [6]  proposes  an  optimised  weighted  refer¬ 
ence  counting  (oWRC)  algorithm.  In  order  to  deal  with 
unreliable  communication  protocols,  oWRC  preserves  a 
weak  invariant  enforcing  that  each  object  weight  (tot2d 
weight)  is  always  greater  or  equal  to  the  sum  of  all  re¬ 
mote  reference  weights  (partial  weight).  The  use  of  a 
weak  invariant  allows  the  2dgorithm  to  tolerate  message 
loss  but  duplicated  messages  remain  problematic. 

Mancini  and  Shrivastava  [15]  present  an  efficient  and 
fault-tolerant  reference-counting  distributed  garbage 
collector.  A  reliable  RPC  mechanism,  extended  to  de¬ 
tect  and  kill  orphans,  provides  resilience  to  failures.  A 
special  protocol  copes  with  duplication  of  remote  refer¬ 
ences,  by  making  an  early  short-cut  of  potential  indi¬ 


rections  even  if  they  are  never  used. 

In  the  future  we  expect  to  add  to  our  protocol  a 
separate  mechanism  to  deal  with  distributed  cycles  of 
garbage,  which  are  not  currently  handled.  There  are 
several  proposals  in  the  literature,  e.g.  Bishop’s  migra¬ 
tion  technique  [2]  or  Schelvis’  cycle-detection  technique 
[22].  We  will  discuss  below  Liskov’s  logically  centralized 
algorithm,  Hughes’  timestamp  algorithm,  and  Lang’s 
dynamic  grouping  technique. 

Lang  et  al.  [13]  propose  to  combine  a  distributed  ref¬ 
erence  count  with  the  dynamic  grouping  of  nodes  with 
distributed  mark-and-sweep  within  each  group.  The  ref¬ 
erence  counts  must  be  accurate,  hence  message  failures 
are  not  tolerated.  The  mark-and-sweep  algorithm  relies 
heavily  on  termination  protocols,  which  are  not  scalable. 
A  distributed  garbage  cycle  that  crosses  group  bound¬ 
aries  is  not  collected  until  another  group  is  formed,  en¬ 
closing  the  whole  cycle;  therefore  liveness  is  not  guar¬ 
anteed.  A  failure  during  a  collection  causes  group  re¬ 
organisation  excluding  the  failed  node,  restarting  the 
group  GC. 

Hughes  [11]  uses  a  global  clock.  A  collector,  start¬ 
ing  from  some  local  root  at  time  i,  marks  all  objects  it 
reaches  with  the  value  t.  The  marking  on  a  reachable 
object  will  advance  periodically;  on  an  unreachable  ob¬ 
ject  the  muk  will  not  change.  Objects  marked  with  a 
date  less  thw  some  global  minimum  ue  collected.  De¬ 
termining  the  minimum  requires  repeated  execution  of 
a  global  termination  algorithm.  Furthermore,  if  even 
a  single  processor  is  disconnected,  it  is  impossible  to 
advance  the  minimum. 

Liskov  and  Ladin  [14]  describe  a  fault  tolerant  dis¬ 
tributed  garbage  detector  based  on  their  highly  avail¬ 
able  logically-centralised  service.  Each  local  collec¬ 
tor  informs  the  centralised  service  of  incoming  and  out¬ 
going  references,  and  about  the  paths  between  incom¬ 
ing  and  outgoing  references.  The  path  computation  is 
expensive  but  necessary  for  reclamation  of  distributed 
garbage  cycles.  Based  on  the  paths  transmitted,  the 
centralised  service  builds  the  graph  of  inter-site  refer¬ 
ences,  and  detects  garbage  (includin  dead  cycles)  with 
a  standard  tracing  algorithm.  The  centralised  service 
informs  LGCs  of  accessibility  of  objects. 

In  a  later  paper  [12]  Ladin  and  Liskov  simplify,  and 
correct  the  deficiencies  of,  the  above  proposal,  adopt¬ 
ing  Hughes’  algorithm  and  loosely  synchronised  local 
clocks.  Hughes’  algorithm  eliminates  inter-space  cycles 
of  garbage,  thereby  eliminating  the  need  for  for  an  accu- 
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rate  computation  of  the  paths  and  for  the  central  service 
to  maintain  an  image  of  the  global  references.  Further¬ 
more,  the  centralized  service  determines  the  garbage 
threshold  date,  making  a  termination  protocol  unnec¬ 
essary. 

6  Conclusion 

We  have  presented  scalable  location-transparent  refer¬ 
ences  to  objects  in  a  distributed  system,  with  well- 
defined  failure  semantics.  Integrated  into  the  approach 
is  fault-tolerant  automatically  collection  of  acyclic  dis¬ 
tributed  garbage,  which  can  be  combined  with  any  rea¬ 
sonable  local  garbage  collection  algorithm.  The  mech¬ 
anism  is  effective,  inexpensive,  straightforward  and  is 
based  on  a  novel  combination  of  well-known  techniques. 
Our  mechanism  requires  no  global  search  or  synchro¬ 
nization  and  uses  a  very  cheap  transport  protocol  (not 
requiring  multicast  communications  or  any  particular 
ordering  on  messages).  The  key  enabling  concepts  are 
the  scions,  i.e.  inverse  reference  lists  (as  opposed  to  ref¬ 
erence  counts),  and  the  use  of  timestamps  and  window¬ 
ing  protocols  to  support  idempotent  deletion.  In  con¬ 
junction  a  the  conservative  creation  policy,  these  pro¬ 
vide  a  fault- tolerant  and  efficient  mechanism. 

The  current  specification  suffers  from  some  limita¬ 
tions.  First,  only  acyclic  garbage  is  collected;  it  will 
be  necessary  to  extend  the  mechanisms  to  collect  dis¬ 
tributed  cyclic  garbage.  Second,  recovery  from  space 
termination  is  incompletely  specified.  Third,  although 
the  main-line  protocol  is  scalable,  the  recovery  proto¬ 
col  entails  global  search;  to  limit  the  cost  of  search,  we 
pointed  at  the  need  to  structure  the  universe  into  small 
partitions  (in  which  exhaustive  search  remains  realistic) 
connected  by  gateways,  but  this  needs  more  work. 

A  first  version  of  the  garbage  collection  protocol  has 
been  prototyped;  its  measured  performance  is  similar  to 
an  existing,  non  fault-tolerant,  non  scalable,  distributed 
collector.  We  are  currently  in  the  process  of  implement¬ 
ing  the  specifications  of  this  paper,  as  a  system  level 
facility  in  the  Soul  object-support  layer  [23]. 
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The  Weakest  Failure  Detector  for  Solving  Consensus* 
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Abstract 

We  determine  what  information  about  failures 
is  necessary  and  sufficient  to  solve  Consensus  in 
asynchronous  distributed  systems  subject  to  crash 
failures.  In  [CT91],  we  proved  that  OW,  a  failure 
detector  that  provides  surprisingly  little  informa¬ 
tion  about  which  processes  have  crashed,  is  suffi¬ 
cient  to  solve  Consensus  in  asynchronous  systems 
with  a  majority  of  correct  processes.  In  this  paper, 
we  prove  that  to  solve  Consensus,  any  failure  de¬ 
tector  has  to  provide  at  least  as  much  information 
as  OVV.  Thus,  OW  is  indeed  the  weakest  failure 
detector  for  solving  Consensus  in  asynchronous 
systems  with  a  majority  of  correct  processes. 

1  Introduction 

1.1  Background 

The  asynchronous  model  of  distributed  comput¬ 
ing  has  been  extensively  studied.  Informally,  an 
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asynchronous  distributed  system  is  one  in  which 
message  transmission  times  and  relative  processor 
speeds  are  both  unbounded.  Thus  an  algorithm 
designed  for  an  asynchronous  system  does  not  rely 
on  such  bounds  for  its  correctness.  In  practice, 
asynchrony  is  introduced  by  unpredictable  loads 
on  the  system. 

Although  the  as}mchronous  model  of  computa¬ 
tion  is  attractive  for  the  reasons  outlined  above, 
it  is  well-known  that  many  fundamental  problems 
of  fault-tolerant  distributed  computing  that  are 
solvable  in  synchronous  systems,  are  unsolvable  in 
asynchronous  systems.  In  particular,  it  is  well- 
known  that  Consensus^  and  several  forms  of  reli¬ 
able  broadcast,  including  Atomic  Broadcast,  can¬ 
not  be  solved  deterministically  in  an  asynchronous 
system  that  is  subject  to  even  a  single  crash  failure 
[FLP85,  DDS87].  Essentially,  these  impossibility 
results  stem  from  the  inherent  difficulty  of  deter¬ 
mining  whether  a  process  has  actually  crashed  or 
is  only  “very  slow”. 

To  circumvent  these  impossibihty  results,  pre¬ 
vious  research  focused  on  the  use  of  randomiza¬ 
tion  techniques  [CD89],  the  definition  of  some 
weaker  problems  and  their  solutions  [DLP'''86, 
ABND'''87,  BW87],  or  the  study  of  several  mod¬ 
els  of  partial  synchrony  [DDS87,  DLS88].  How¬ 
ever,  the  impossibility  of  deterministic  solutions 
to  many  agreement  problems  (such  as  Consensus 
and  Atomic  Broadcast)  remains  a  major  obstacle 
to  the  use  of  the  asynchronous  model  of  computa¬ 
tion  for  fault-tolerant  distributed  computing. 

An  alternative  approach  to  circumvent  such  im¬ 
possibility  results  is  to  augment  the  asynchronous 
model  of  computation  with  a  failure  detector.  In¬ 
formally,  a  failure  detector  is  a  distributed  oracle 
that  gives  (possibly  incorrect)  hints  about  which 
processes  may  have  crashed  so  far:  Each  process 
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has  access  to  a  local  failure  detector  module  that 
monitors  other  processes  in  the  system,  and  main¬ 
tains  a  list  of  those  that  it  currently  suspects  to 
have  crashed.  Each  process  periodically  consults 
its  failure  detector  module,  and  uses  the  list  of 
suspects  returned  in  solving  Consensus. 

A  failure  detector  module  can  make  mistakes  by 
erroneously  adding  processes  to  its  list  of  suspects: 
i.e.,  it  can  suspect  that  a  process  p  has  crashed 
even  though  p  is  stiU  running.  If  it  later  believes 
that  suspecting  p  was  a  mistake,  it  can  remove  p 
from  its  list.  Thus,  each  module  may  repeatedly 
add  and  remove  processes  from  its  list  of  suspects. 
Furthermore,  at  any  given  time  the  failure  detec¬ 
tor  modules  at  two  different  processes  may  have 
different  lists  of  suspects. 

It  is  important  to  note  that  the  mistakes  made 
by  a  failure  detector  shoiild  not  prevent  any  cor¬ 
rect  process  from  behaving  according  to  specifi¬ 
cation.  For  example,  consider  an  algorithm  that 
uses  a  failure  detector  to  solve  Atomic  Broadcast 
in  an  asynchronous  system.  Suppose  all  the  fail¬ 
ure  detector  modules  wrongly  (and  permanently) 
suspect  that  a  correct  process  p  has  crashed.  The 
Atomic  Broadcast  algorithm  must  stiU  ensure  that 
p  delivers  the  same  set  of  messages,  in  the  same 
order,  as  all  the  other  correct  processes.  Further¬ 
more,  if  p  broadcasts  a  message  m,  all  correct  pro¬ 
cesses  must  deliver  m} 

In  [CT91],  we  showed  that  a  surprisingly  weak 
failure  detector  is  sufficient  to  solve  Consensus  and 
Atomic  Broadcast  in  asynchronous  systems  with  a 
majority  of  correct  processes.  This  failure  detec¬ 
tor,  called  the  eventually  weak  failure  detector  and 
denoted  W  here,  satisfies  only  the  following  two 
properties:^ 

1.  There  is  a  time  after  which  every  process  that 
crashes  is  always  suspected  by  some  correct 
process. 

2.  There  is  a  time  after  which  some  correct  pro¬ 
cess  is  never  suspected  by  any  correct  process. 

difFeient  approach  was  taken  in  [RB91]:  a  correct 
process  that  is  wrongly  suspected  to  hare  crashed,  yolun- 
taiily  leaves  the  system.  It  may  later  rejoin  the  system  by 
assuming  a  new  identity. 

’in  [CT91],  this  was  denoted  OVV. 


Note  that,  at  any  given  time  t,  processes  can¬ 
not  use  W  to  determine  the  identity  of  a  cor¬ 
rect  process.  Furthermore,  they  cannot  determine 
whether  there  is  a  correct  process  that  will  not  be 
suspected  after  time  t. 

The  failure  detector  W  can  make  an  infinite 
number  of  mistakes.  In  fact,  it  can  forever  add  and 
then  remove  some  correct  processes  from  the  lists 
of  suspects  (this  reflects  the  inherent  difficulty  of 
determining  whether  a  process  is  just  slow  or  has 
crashed).  Moreover,  some  correct  processes  may 
be  erroneously  suspected  to  have  crashed  by  all 
the  other  processes  throughout  the  entire  execu¬ 
tion. 

The  two  properties  of  W  state  that  eventually 
something  must  hold  forever;  this  may  appear  too 
strong  a  requirement  to  implement  in  practice. 
However,  when  solving  a  problem  that  “termi¬ 
nates”,  such  as  Consensus,  it  is  not  really  required 
that  the  properties  hold  forever,  but  merely  that 
they  hold  for  a  sufficiently  long  time,  i.e.,  long 
enough  for  the  algorithm  that  uses  the  failure  de¬ 
tector  to  achieve  its  goal.  For  instance,  in  practice 
the  algorithm  of  [CT91]  that  solves  Consensus  us¬ 
ing  W  only  needs  the  two  properties  of  W  to  hold 
for  a  relatively  short  period  of  time.^  However,  in 
an  asynchronous  system  it  is  not  possible  to  quan¬ 
tify  “sufficiently  long”,  since  even  a  single  process 
step  or  a  single  message  transmission  is  allowed  to 
take  an  arbitrarily  long  amount  of  time.  Thus  it 
is  convenient  to  state  the  properties  of  W  in  the 
stronger  form  given  above. 


The  failure  detection  properties  of  W  are  sufficient 
to  solve  Consensus  in  asynchronous  systems.  But 
are  they  necessary?  For  example,  consider  failure 
detector  A  that  satisfies  Property  1  of  W  and  the 
following  weakening  of  Property  2: 

There  is  a  time  after  which  some  cor¬ 
rect  process  is  never  suspected  by  at  least 
99%  of  the  correct  processes. 

’in  that  algorithm  processes  are  cyclically  elected  as 
“coordinators”.  Consensus  is  achieved  as  soon  as  a  cor¬ 
rect  coordinator  is  reached,  and  no  process  suspects  it  to 
have  crashed  while  this  coordinator  is  trying  to  enforce 
consensus. 


1.2  The  problem 
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A  is  clearly  weaker  than  W.  Is  it  possible  to  solve 
Consensus  using  A?  Indeed  what  is  the  weakest 
failure  detector  sufficient  to  solve  Consensus  in 
asynchronous  systems?  In  trying  to  answer  this 
fundamental  question  we  run  into  a  problem.  Con¬ 
sider  failme  detector  B  that  satisfies  the  following 
two  properties: 

1.  There  is  a  time  after  which  every  process  that 
crashes  is  always  suspected  by  all  correct  pro¬ 
cesses. 

2.  There  is  a  time  after  which  some  correct  pro¬ 
cess  is  never  suspected  by  a  majority  of  the 
processes. 

It  seems  that  B  and  W  are  incomparable:  S’s 
first  property  is  stronger  than  W’s,  and  5’s  sec¬ 
ond  property  is  weaker  than  W’s.  Is  it  possible  to 
solve  Consensus  in  an  asynchronous  system  using 
B?  The  answer  turns  out  to  be  “yes”  (provided 
that  this  asjmchronous  system  has  a  majority  of 
correct  processes,  as  W  also  requires).  Since  W 
and  B  appear  to  be  incomparable,  one  may  be 
tempted  to  conclude  that  W  cannot  be  the  “weak¬ 
est”  failure  detector  with  which  Consensus  is  solv¬ 
able.  Even  worse,  it  raises  the  possibility  that  no 
such  “weakest”  failure  detector  exists. 

However,  a  closer  examination  reveals  that  B 
and  W  are  indeed  comparable  in  a  natural  way: 
There  is  a  distributed  algorithm  Tb-^w  that  can 
transform  B  into  a  failure  detector  with  the  Prop¬ 
erties  1  and  2  of  W.  Tb_,w  works  for  any  asyn¬ 
chronous  system  that  has  a  majority  of  correct 
processes.  We  say  that  W is  reducible  to  Bin  such 
a  system.  Since  Ts-^yv  is  able  to  transform  B  into 
W  in  an  asynchronous  system,  B  must  provide  at 
least  as  much  information  about  process  failures 
as  W  does.  Intuitively,  B  is  at  least  as  strong  as 
W. 

1.3  The  result 

In  [CT91] ,  we  showed  that  W  is  sufficient  to  solve 
Consensus  in  asynchronous  systems  if  and  only  if 
n  >  2/  (where  n  is  the  total  number  of  processes, 
and  /  is  the  maximum  number  of  processes  that 
may  crash).  In  this  paper,  we  prove  that  W  is  re¬ 
ducible  to  any  failure  detector  D  that  can  be  used 


to  solve  Consensus  (this  result  holds  for  any  asyn¬ 
chronous  system).  We  show  this  reduction  by  giv¬ 
ing  a  distributed  algorithm  T£>-,w  that  transforms 
any  such  D  into  W.  Therefore,  W  is  indeed  the 
weakest  failure  detector  that  can  be  used  to  solve 
Consensus  in  asynchronous  systems  with  n  >  2/. 
Furthermore,  if  n  <  2/,  any  failure  detector  that 
can  be  used  to  solve  Consensus  must  be  strictly 
stronger  than  W. 

The  task  of  transforming  any  given  failure  de¬ 
tector  D  (that  can  be  used  to  solve  Consensus) 
into  W  runs  into  a  serious  technic^d  difficulty  for 
the  following  reasons: 

•  To  strengthen  our  result,  we  do  not  restrict 
the  output  of  D  to  lists  of  suspects.  Instead, 
this  output  can  be  any  value  that  encodes 
some  information  about  failures.  For  exam¬ 
ple,  a  failure  detector  D  should  be  allowed  to 
output  any  boolean  formula,  such  as  “(not  p) 
and  {q  or  r)”  (i.e.,  p  is  up  and  either  qoir  has 
crashed) — or  any  encoding  of  such  a  formula. 
Indeed,  the  output  of  D  could  be  an  arbitrar¬ 
ily  complex  (and  unknown)  encoding  of  fail¬ 
ure  information.  Our  transformation  from  D 
into  W  must  be  able  to  decode  this  informa¬ 
tion. 

•  Even  if  the  failure  information  provided  by  D 
is  not  encoded,  it  is  not  clear  how  to  extract 
from  it  the  failure  detection  properties  of  W. 
Consequently,  if  D  is  given  in  isolation,  the 
task  of  transforming  it  into  W  may  not  be 
possible. 

Fortunately,  since  D  can  be  used  to  solve 
Consensus,  there  is  a  corresponding  algorithm. 
Consensus!),  that  is  somehow  able  to  “decode” 
the  information  about  failures  provided  by  D,  and 
knows  how  to  use  it  to  solve  Consensus.  Our  re¬ 
duction  algorithm,  uses  ConsensusD  to  ex¬ 

tract  this  information  from  D  and  transforms  it 
into  the  properties  of  W. 

2  The  model 

We  describe  a  model  of  asynchronous  computation 
with  failure  detection  patterned  idter  the  one  in 
[FLP85]. 
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2.1  Failure  Detectors 

We  assume  the  existence  of  a  discrete  global  clock 
to  simplify  the  presentation.  This  is  merely  a  fic¬ 
tional  device:  the  processes  do  not  have  access  to 
it.  We  take  the  range  T  of  the  clock’s  ticks  to  be 
the  set  of  natural  numbers. 

The  system  consists  of  a  set  of  n  processes^ 
n  =  {Pi>P2>--->Pn}  that  may  fail  by  crashing. 
A  failure  pattern  F  is  &  function  firom  T  to  2^, 
where  F(t)  denotes  the  set  of  processes  that  have 
crashed  through  time  t.  Once  a  process  crashes,  it 
does  not  “recover”,  i.e.,  Vt :  F{t)  C  F(t  +  1).  We 
define  crashed^F)  =  UteT-^(t)  correcl{F)  = 
n—  crashed^F).  Ifp  €  crashed{F)  we  say  p  crashes 
in  F  and  if  p  €  correct^F)  we  say  p  is  correct  in 
F. 

Associated  with  each  failure  detector  is  a  range 
7i  of  values  output  by  that  failure  detector.  A 
failure  detector  history  H  with  range  'll  is  &  func¬ 
tion  from  n  X  T  to  72..  H(p,f)  is  the  value  of 
the  failure  detector  module  of  process  p  at  time 
t.  A  failure  detector  D  is  a  function  that  maps 
each  failure  pattern  F  to  &  set  of  failure  detector 
histories  with  range  71  d  (where  71  jj  denotes  the 
range  of  failure  detector  outputs  of  D).  D(F)  de¬ 
notes  the  set  of  possible  failure  detector  histories 
permitted  by  D  for  the  failure  pattern  F. 

For  example,  consider  the  failure  detector  W 
mentioned  in  the  introduction.  Each  failure  detec¬ 
tor  module  of  W  outputs  a  set  of  processes  that  are 
suspected  to  have  crashed:  in  this  case  72 w  =  2^. 
For  each  failure  pattern  F,  W{F)  is  the  set  of  all 
failure  detector  histories  ffvv  range  72w  that 
satisfy  the  following  properties: 

1.  There  is  a  time  after  which  every  process  that 
crashes  in  F  is  always  suspected  by  some  pro¬ 
cess  that  is  correct  in  F: 

3t  ^Tf'ip  £  crashed{F),3q  6  correct(F), 

Vf'  >  f  :  p  € 

2.  There  is  a  time  after  which  some  process  that 
is  correct  in  F  is  never  suspected  by  any  pro¬ 
cess  that  is  correct  in  F: 

3t  €  T,3p  €  correct(F),Vg  €  correct(F), 

W>t:p^HMq,t') 


Note  that  we  specify  a  failure  detector  D  as  a 
function  of  the  failure  pattern  F  of  an  execution. 
However,  this  does  not  preclude  an  implementa¬ 
tion  of  D  firom  using  other  aspects  of  the  execu¬ 
tion  such  as  when  messages  are  received.  Thus, 
executions  with  the  same  failure  pattern  F  may 
still  have  different  failme  detector  histories.  It  is 
for  this  reason  that  we  allow  D{F)  to  be  a  set 
of  failure  detector  histories  from  which  the  actual 
failure  detector  history  for  a  particular  execution 
is  selected  non-deterministically. 

2.2  Algorithms 

We  model  the  asynchronous  communication  chan¬ 
nels  as  a  message  buffer  whidi  contains  messages 
of  the  form  (p,  data,  q)  indicating  that  process  p 
has  sent  data  addressed  to  process  q  and  q  has 
not  yet  received  that  message.  An  algorithm  A 
is  a  collection  of  n  (possibly  infinite  state)  deter¬ 
ministic  automata,  one  for  each  of  the  processes. 
A(p)  denotes  the  automaton  running  on  process 
p.  Computation  proceeds  in  steps  of  the  given  al¬ 
gorithm  A.  In  each  step  of  A,  process  p  performs 
atomically  the  following  three  phases: 

Receive  phase:  p  receives  a  single  message  of 
the  form  {q,data,p)  firom  the  message  buffer, 
or  a  “null”  message,  denoted  A,  meaning  that 
no  message  is  received  by  p  during  this  step. 

Failure  detector  query  phase:  p  queries  and 
receives  a  value  from  its  failure  detector  mod¬ 
ule.  We  say  that  p  sees  a  value  d  when  the 
value  returned  by  p’s  failure  detector  module 
is  d. 

Send  phase:  p  changes  its  state  and  sends  a  mes¬ 
sage  to  all  the  processes  according  to  the  au¬ 
tomaton  A(p),  based  on  its  state  at  the  be¬ 
ginning  of  the  step,  the  message  received  in 
the  receive  phase,  and  the  value  that  p  sees  in 
the  failure  detector  query  phase.^ 

*I]i  the  send  phase,  p  sends  a  message  to  all  the  processes 
atomically.  As  was  shown  in  [FLPSB],  the  ability  to  do 
so  is  not  sufficient  for  solving  Consensus.  An  alternative 
formulation  of  a  step  could  restrict  a  process  to  sending  a 
message  to  a  single  process  in  the  send  phase.  We  can  show 
that  both  formulations  are  equivalent  for  our  purposes. 
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The  message  actually  received  by  the  process  p  in 
the  receive  phase  is  chosen  non-determinisUadly 
from  amongst  the  messages  in  the  message  buffer 
destined  to  p,  and  the  null  message  A.  The  null 
message  may  be  received  even  if  there  are  mes¬ 
sages  in  the  message  buffer  that  are  destined  to 
p:  the  fact  that  m  is  in  the  message  buffer  merely 
indicates  that  m  was  sent  to  p.  Since  ours  will 
be  a  model  of  asynchronous  systems,  where  mes¬ 
sages  may  experience  arbitrary  (but  finite)  delays, 
the  amoimt  of  time  m  may  remain  in  the  message 
buffer  before  it  is  received  is  unbounded.  Though 
message  delays  are  arbitrary,  we  also  want  them  to 
be  finite.  We  model  this  by  introducing  a  liveness 
assumption:  every  message  sent  will  eventually  be 
received,  provided  its  recipient  makes  “sufficiently 
many”  attempts  to  receive  messages.  All  this  will 
be  made  more  precise  later. 

To  keep  things  simple  we  assume  that  a  process 
p  sends  a  message  m  to  9  at  most  once.  This 
allows  tis  to  speak  of  the  contents  of  the  message 
buffer  as  a  set,  rather  than  a  multiset.  We  can 
easily  enforce  this  by  adding  a  cotmter  to  each 
message  sent  by  p  to  9  —  so  this  assumption  does 
not  damage  generality. 

2.3  Configurations,  Runs  and  Environ¬ 
ments 

A  configuration  is  a  pair  (s,  M),  where  s  is  a  func¬ 
tion  mapping  each  process  p  to  its  local  state,  and 
M  is  a  set  of  triples  of  the  form  (q,data,p)  rep¬ 
resenting  the  messages  presently  in  the  message 
buffer.  An  initial  configuration  of  an  algorithm  A 
is  a  configuration  (s,Af),  where  s(p)  is  an  initial 
state  of  A(p)  and  M  =  0.  A  step  of  a  given  algo¬ 
rithm  A  transforms  one  configuration  to  another. 
A  step  of  A  is  uniquely  determined  by  the  identity 
of  the  process  p  that  takes  the  step,  the  message 
m  received  by  p  during  that  step,  and  the  fedlure 
detector  value  d  seen  by  p  during  the  step.  Thus, 
we  identify  a  step  of  A  with  a  tuple  (p,m,d,A) 
(m  =  A  when  the  null  message  is  received).  We  say 
that  a  step  e  =  (p,  m,  d,  A)  is  applicable  to  a  con¬ 
figuration  C  =  («,  M)  if  and  only  if  m  6  Af  U  {A}. 
We  write  e{C)  to  denote  the  unique  configuration 
that  results  when  e  is  applied  to  C. 

A  schedule  S  of  algorithm  A  is  a  (possibly  finite) 


sequence  of  steps  of  A.  S±  denotes  the  empty 
schedule.  We  say  that  a  schedule  5  of  an  algorithm 
A  is  applicable  to  a  configuration  C  if  and  only  if 
(a)  S  =  S^,  or  (b)  5[1]  is  applicable  to  C,  5[2] 
is  applicable  to  5[1](C),  etc.^  If  5  is  a  finite 
schedule  applicable  to  C,  S{C)  denotes  the  unique 
configuration  that  results  from  applying  5  to  C. 
Note  5i(C)  =  C  for  all  configurations  C. 

A  partial  run  of  algorithm  A  using  a  failure  de¬ 
tector  D  is  a  tuple  R  =  {F,Hij,I,S^T)  where  F 
is  a  failme  pattern,  Hu  £  D{F)  is  a  failure  de¬ 
tector  history,  7  is  an  initial  configuration  of  A,  S 
is  a  finite  schedule  of  A,  and  T  is  a  finite  list  of 
increasing  time  values  (indicating  when  each  step 
in  S  occurred)  such  that  \S\  =  |T|,  S  is  applicsi- 
ble  to  7,  and  for  all  t  <  |5|,  if  5[t]  is  of  the  form 
(p,m,d,A)  then: 

•  p  has  not  crashed  by  time  T[i],  i.e.,  p  ^ 
FiT[t\) 

•  d  is  the  value  of  the  failure  detector  module 
of  p  at  time  T[t],  i.e.,  d  =  .ffu(p, T[i]) 

Informally,  a  partial  run  of  A  using  D  represents 
a  finite  point  of  some  execution  of  A  using  D. 

A  run  of  an  algorithm  A  using  a  failure  detector 
D  is  a  tuple  R  =  {F,Hd,I,S,T)  where  7^  is  a 
failure  pattern,  Hd  £  D{F)  is  a  failure  detector 
history,  7  is  an  initial  configuration  of  A,  5  is  an 
infinite  sdiedule  of  A,  and  T  is  an  infinite  list  of 
increasing  time  values  indicating  when  each  step 
in  5  occurred.  In  addition  to  satisfying  the  above 
properties  of  a  partial  run,  a  run  must  also  satisfy 
the  following  properties: 

•  Every  correct  process  takes  an  infinite  num¬ 
ber  of  steps  in  S. 

•  Every  message  sent  to  a  correct  process  is 
eventually  received. 

In  [CT91],  we  proved  that  any  algorithm  that 
uses  W  to  solve  Consensus  requires  n  >  2/.  With 
other  failture  detectors  the  requirements  may  be 
different.  For  example,  there  is  a  failure  detector 
that  can  be  used  to  solve  Consensus  only  if  pi  and 
P2  do  not  both  crash.  In  general  whether  a  given 

'We  denote  by  v[<]  the  ttk  element  of  n  sequence  v. 
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failure  detector  can  be  used  to  solve  Consensus 
depends  upon  assumptions  about  the  underlying 
‘‘environment”.  Formally,  an  environment  £  (of 
an  asynchronous  system)  is  set  of  possible  failure 
patterns. 

3  The  Consensus  problem 

In  the  Consensus  problem,  each  process  p  has  an 
initial  value,  0  or  1,  and  must  reach  an  irrevocable 
decision  on  one  of  these  values. 

We  say  that  algorithm  A  uses  failure  detector  D 
to  solve  Consensus  in  environment  £  if  every  run 
R  =  of  A  tising  D  where  F  €  £ 

satisfies: 

Termination:  Each  correct  process  eventually 
decides. 

Validity:  Each  correct  process  decides  on  the  ini¬ 
tial  value  of  some  process. 

Agreement:  No  two  correct  processes  decide  dif¬ 
ferently. 

4  Reducibility 

We  now  define  what  it  means  for  an  algorithm 
Td-*D'  to  transform  a  failure  detector  D  into  an¬ 
other  failure  detector  J?'  in  an  environment  £. 
Algorithm  uses  D  to  maintain  a  variable 

outputp  at  every  process  p.  This  variable,  refiected 
in  the  local  state  of  p,  emulates  the  output  of  D' 
at  p.  Let  Or  be  the  history  of  all  the  output 
variables  in  nm  i2,  i.e.,  0/{(p,t)  is  the  value  of 
outputp  at  time  t  in  nm  R.  Algorithm 
transforms  D  into  D'  in  £  i£  and  only  if  for  ev¬ 
ery  run  R  =  {F,Hd,I,S,T}  of  To-^d'  using  D, 
where  F  &  £,  Or  €  D'{F). 

Given  Td~*d',  axyrthing  that  can  be  done  us¬ 
ing  D'  in  £,  can  be  done  using  D  instead.  To  see 
this,  suppose  a  given  algorithm  B  requires  failure 
detector  D'  (when  it  executes  in  £),  but  only  D 
is  avulable.  We  can  stiU  execute  B  as  follows. 
Concurrently  with  B,  we  run  to  transform 

D  into  ly.  We  now  modify  the  failure  detector 
query  phase  of  each  step  of  B  at  process  p:  p  reads 
the  current  value  of  outputp  (which  is  concurrently 


Figtire  1:  IVansforming  D  into  ly 


maintained  by  instead  of  querying  its  fail¬ 

ure  detector  module.  This  is  illustrated  in  Fig.  1. 

Intuitively,  since  is  able  to  use  D  to  em¬ 

ulate  D'y  D  provides  at  least  as  much  information 
about  process  failures  m  £  as  D'  does.  Thus,  if 
there  is  an  algorithm  that  transforms  D 

into  ly  in  £,  we  write  D  >£  D'  and  say  that  P'  is 
reducible  to  D  in  £\  we  also  say  that  P'  is  weaker 
than  D  in  £. 

5  An  outline  of  the  result 

In  [CT91]  we  showed  that  W  can  be  used  to  solve 
Consensus  in  any  environment  in  which  n  >  2/. 
We  now  show  that  W  is  weaker  than  any  failure 
detector  that  can  be  used  to  solve  Consensus.  This 
result  holds  for  any  environment  £.  Together  with 
[CT91],  this  implies  that  W  is  indeed  the  weakest 
failure  detector  that  can  be  used  to  solve  Consen¬ 
sus  in  any  environment  in  which  n  >  2/. 

To  prove  our  result,  we  first  define  a  new  failure 
detector,  denoted  (1,  that  is  at  least  as  strong  as 
W.  We  then  show  that  any  failure  detector  P  that 
can  be  used  to  solve  Consensus  is  at  least  as  strong 
as  (1.  Thus,  P  is  at  least  as  strong  as  W. 

The  output  of  the  failure  detector  module  of  Q 
at  a  process  p  is  a  single  process,  q,  that  p  currently 
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considers  to  be  correct,  we  say  that  p  trusts  q.  In 
this  case,  7La  =  n.  For  each  failure  pattern  F, 
n(F’)  is  the  set  of  all  failure  detector  histories  ifn 
with  range  Ti.(i  that  satbfy  the  following  property: 

•  There  is  a  time  after  which  all  the  correct  pro¬ 
cesses  always  trust  the  same  correct  process: 

3t  €  T,3g  G  corr€ci{F) ,^p  €  correct{F), 

'it'  >t:  Ha{p,t')  =  q 

As  with  W,  the  output  of  the  failure  detector  mod¬ 
ule  of  n  at  a  process  p  may  change  with  time,  i.e., 
p  may  trust  different  processes  at  different  times. 
Furthermore,  at  any  given  time  t,  processes  p  and 
q  may  trust  different  processes. 

Theorem  1:  For  all  environments  S,  Cl  tie  W. 
Proof:  [Sketch]  The  reduction  algorithm  Ta->w 
that  transforms  Cl  into  W  is  as  follows.  Each  pro¬ 
cess  p,  periodically  sets  outputp  <—  11  —  {g},  where 
q  is  the  process  that  p  currently  trusts  according 
to  Cl.  It  is  easy  to  see  that  (in  any  environment  £) 
this  output  satisfies  the  two  properties  of  W.  □ 

Theorem  2:  For  all  environments  £,  if  a  failtire 
detector  D  can  be  used  to  solve  Consensus  in  £, 
then  D  Cl. 

Proof:  The  reduction  algorithm  To-^n  is  shown 
in  Section  6.  It  is  the  core  of  our  result.  □ 

Corollary  3:  For  all  environments  if  a  failure 
detector  D  can  be  used  to  solve  Consensus  in  £, 
then  D  ^.e  VV. 

In  [CT91]  we  proved  that,  for  all  environments  £  in 
which  n  >  2/,  W  can  be  used  to  solve  Consensus. 
Together  with  Corollary  3,  this  shows  that: 

Corollary  4:  For  all  environments  £  in  which 
n  >  2/,  W  is  the  weakest  frulure  detector  that 
can  be  used  to  solve  Consensus  in  £. 

6  The  reduction  algorithm 

Let  f  be  an  environment,  i?  be  a  failure  detec¬ 
tor  that  can  be  used  to  solve  Consensus  in  £,  and 
Consenauso  be  the  Consensus  algorithm  that  uses 


D.  We  describe  an  algorithm  T/7_,n  that  trans¬ 
forms  D  into  0  in  £.  Intuitively,  this  algorithm 
works  as  follows.  Fix  an  arbitrary  run  of  To-^n 
using  D  in  £,  with  failure  pattern  F  €  £,  and 
failure  detector  history  Hu  G  D{F).  We  shall 
first  construct  an  infinite  directed  acyclic  graph, 
denoted  G ,  whose  vertices  are  some  of  the  failure 
detector  values  that  occur  in  Ifjj,  and  whose  edges 
are  consistent  with  the  time  at  which  these  values 
occur.  We  then  show  that  G  induces  a  simulation 
forest  T  that  encodes  an  infinite  set  of  possible 
nms  of  ConaensusD.  Finally,  we  show  how  to  ex¬ 
tract  from  T  the  identity  of  a  process  p*  that  is 
correct  in  F. 

The  induced  simulation  forest  is  infinite  and 
thus  it  cannot  be  computed  by  any  process.  How¬ 
ever,  the  information  needed  to  extract  p*  is 
present  in  a  finite  subgraph  of  the  forest.  It  will 
be  sufficient  for  each  correct  process  p  to  construct 
ever  increasing  finite  approximations  of  the  simu¬ 
lation  forest  T  that  will  eventually  include  this 
crucial  finite  subgraph.  At  all  times,  p  uses  its 
present  approximation  of  T  to  select  the  identity 
of  some  process:  once  p’s  approximation  of  T  in¬ 
cludes  the  crucial  finite  subgraph,  the  selected  pro¬ 
cess  will  be  p*  (forever).  Thus,  there  is  a  time  after 
which  aU  correct  processes  trust  the  same  correct 
process,  p* — which  is  exactly  what  n  requires. 

We  say  that  a  process  is  correct  [crashes)  if  it 
is  correct  (crashes)  in  F.  For  simplicity,  we  eis- 
sume  that  a  process  p  sees  a  value  d  at  most  once 
(this  can  be  enforced  by  tagging  a  counter  to  each 
value  seen).  For  the  rest  of  this  paper,  whenever 
we  refer  to  a  lun  of  Consensus]),  we  mean  a  nm  of 
Consensus])  using  D.  Furthermore,  we  only  con¬ 
sider  schedffies  of  Consensus]),  and  therefore  we 
write  [p,m,d)  instead  of  [p,m,d.  Consensus]))  to 
denote  a  step. 

6.1  A  DAG  and  a  forest 

Given  the  failure  pattern  F  and  the  correspond¬ 
ing  failure  detector  history  Ho  €  D{F)  that  were 
fixed  above,  let  G  be  any  infinite  directed  acyclic 
graph  with  the  following  properties: 

1.  The  vertices  of  G  are  of  the  form  [p,d]  where 
d  =  H]}[p,t)  for  some  time  t. 
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2.  If  — » [gztd^  is  an  edge  of  G  and  di  = 

*iod  d2  =  HD{g2it2)  then  ti  <  <2* 

3.  Cr  is  transitively  closed. 

4.  Let  p  be  any  correct  process  and  V  be  a  finite 
subset  of  vertices  of  G.  There  is  a  failure  de¬ 
tector  value  d  such  that  for  all  vertices  [p',d'] 
hi  V",  [p',d']  — » [p,d]  is  an  edge  of  G. 

Note  that  such  a  DAG  represents  only  a  ‘‘sam¬ 
pling”  of  the  failure  detector  values  that  occur  in 
Hd-  In  particular,  we  do  not  require  that  it  con¬ 
tain  all  the  values  that  occur  in  Hd  or  that  it 
relate  (with  an  edge)  all  the  values  according  to 
the  time  at  which  they  occur. 

Let  g  —  [gi,di],[52,d2],...  be  any  (finite  or 
infinite)  path  of  G.  A  schedule  S  is  compati¬ 
ble  with  9  if  it  has  the  same  length  as  g,  and 
5  =  (gi,mi,di),(g2,m2,d2))  -  M  for  some  (pos¬ 
sibly  null)  messages  mi, m2,...;  5  is  compatible 
with  G  if  it  is  compatible  with  some  path  of  G.  S 
is  induced  by  g  and  an  initial  configuration  I  (of 
Consensus!))  if  5  is  compatible  with  g  and  applica¬ 
ble  to  I.  Sis  induced  by  G  and /  if  5 is  compatible 
with  G  and  applicable  to  I.  Note  that  each  g  and 
I  induce  several  schedules,  each  corresponding  to 
a  different  sequence  of  messages  received. 

Lemma  5:  Let  S  be  any  finite  schedule 
induced  by  G  and  some  initial  configuration 
I  of  Consensus!).  There  is  a  T  such  that 
{F,Hd,I,S,T)  is  a  partial  nm  of  Consensus!). 

Lemma  6:  Let  S  be  any  infinite  schedfie  induced 
by  G  and  some  /,  such  that  every  correct  pro¬ 
cess  takes  an  infinite  number  of  steps  and  every 
message  sent  to  a  correct  process  is  eventually  re¬ 
ceived.  There  is  a  T  such  that  {F,Hd,I,S,T)  is 
a  run  of  Consensus!). 

The  set  of  schedules  that  are  induced  by  G  and 
some  particiilar  /,  can  be  organized  as  a  tree,  the 
simulation  tree  Tq  induced  by  G  and  I.  These 
schedules  are  the  vertices  of  the  tree,  with  (the 
empty  schedule)  5x  at  the  root.  There  is  an  edge 
firom  5  to  S'  if  and  only  if  S'  =  S  •  c  for  a  step  c. 

Lemma  7:  Let  S  be  ai^  vertex  of  and  p  be 
any  correct  process.  Let  m  be  a  message  in  the 


message  buffer  of  S(J)  addressed  to  p  ox  the  null 
message.  Tq  has  a  vertex  S  •  {p,m,d)  for  some  d. 

Lemma  8:  Let  S,  Si ,  S2, . . . ,  Sfc  be  vertices  of  . 
There  is  a  schedule  E  containing  only  steps  of  cor¬ 
rect  processes  such  that: 

1.  S-B  is  a  vertex  of  and  all  correct  processes 
have  decided  in  S  ■  ■&(/). 

2.  Si  •  B  (1  <  i  <  k)  is  compatible  with  G. 

Note  that  B  may  not  be  applicable  to  Sj(/),  and 
thus  Si -B  is  not  necessarily  a  vertex  of  . 

Let  I*,  0  <  t  <  n  denote  the  initial  configurar 
tion  of  Consensus!)  in  which  the  initial  values  of 
Pi . .  .pi  axe  1,  and  the  initial  values  of  pj.^i . .  .pn 
are  0.  We  define  the  simulation  forest  induced  by 
G  to  be  the  set  of  n  -f- 1  simulation  trees  induced 
by  G  and  these  initial  configurations. 


We  assign  a  set  of  tags  to  each  vertex  of  each  tree 
Tq  in  the  simulation  forest  induced  by  G.  Vertex 
5  of  Tq  receives  tag  k  if  and  only  if  it  has  a  descen- 
dent  5'  such  that  some  correct  process  has  decided 
k  in  S'{P).  Hereafter,  T*  denotes  the  tagged  tree 
Tq,  and  T  denotes  the  tagged  simulation  forest. 

Lemma  9:  Each  vertex  of  T*  has  at  least  one  tag. 

A  vertex  of  T*  is  monovalent  if  it  has  only  one 
tag,  and  bivalent  if  it  has  both  tags,  0  and  1.  A 
vertex  is  0-valent  if  it  is  monovalent  and  is  tagged 
0;  1-valent  is  similarly  defined. 

Lemma  10:  The  ancestors  of  a  bivalent  vertex 
are  bivalent.  The  descendents  of  a  I- valent  vertex 
are  k-valent. 

Lemma  11:  If  5  is  a  bivalent  vertex  of  T*  then 
no  correct  process  has  decided  in  S{P). 

Recall  that  in  all  processes  have  initial  value  0, 
while  in  P*  they  all  have  initial  value  1. 

Lemma  12:  The  root  of  is  0-valent  and  the 
root  of  T"  is  1-valent. 


6.2  Tagging  the  simulation  forest 
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If  the  root  of  T*  is  bivalent,  then  t  is  bivalent  eriU 
ieaL  If  the  root  of  is  0-valent  but  the  root  of 
T*  is  1-valent,  then  t  u  monovalent  critical  Index 
i  is  criticid  if  it  is  monovalent  or  bivalent  critical. 

Lemma  IS:  There  is  a  critical  t,  0  <  t  <  n. 

The  critical  index  t  is  the  key  to  extracting  the 
identity  of  a  correct  process.  In  fact,  if  t  is  monova¬ 
lent  critical,  we  shall  prove  that  pi  must  be  correct 
(Lemma  15).  If  t  is  bivalent  critical,  the  correct 
process  will  be  found  by  focusing  on  the  tree  T*, 
as  explained  in  the  following  section. 

6.3  Of  hooks  and  forks 

We  describe  two  types  of  finite  subtrees  of  T*  re¬ 
ferred  to  as  decision  gadgets  ofX*.  Each  type  of 
decision  gadget  is  rooted  at  5j.  and  has  exactly 
two  leaves:  one  0-valent  and  one  1-valent.  The 
least  common  ancestor  of  these  leaves  is  called  the 
pivot.  The  pivot  is  clearly  bivalent. 

The  first  type  of  decision  gadget  is  called  a  fork, 
and  is  shown  in  Figure  2.  The  two  leaves  are  chil¬ 
dren  of  the  pivot,  obtained  by  applying  different 
steps  of  the  same  process  p.  Process  p  is  the  decid¬ 
ing  process  of  the  fork,  because  its  step  after  the 
pivot  determines  the  decision  of  correct  processes. 

The  second  type  of  decision  gadget  is  called  a 
hook,  and  is  shown  in  Figure  3.  Let  5  be  the  pivot 
of  the  hook.  There  is  a  step  e  such  that  5  •  e  is 
one  leaf,  and  the  other  leaf  is  S  -  (p,m,d)  •  e  for 
some  p,  m,  d.  Process  p  is  the  deciding  process  of 
the  hook,  because  the  decision  of  correct  process 
is  determined  by  whether  p  takes  the  step  (p,  m,  c. 
before  e. 

We  shall  prove  that  the  deciding  process  p  of  a 
gadget  must  be  correct  (Lemma  16).  Intuitively, 
this  is  because  if  p  crashes  no  process  can  figure 
out  whether  p  has  taken  the  step  that  determines 
the  decision  value.  The  existence  of  such  a  criti¬ 
cal  “hidden”  step  is  also  at  the  core  of  many  im¬ 
possibility  proofs  starting  with  [FLP85].  In  our 
case,  the  “hiding”  is  more  difficult  because  now 
processes  have  recourse  to  the  failure  detector  D. 
Despite  this,  the  hiding  of  the  step  of  the  deciding 
process  of  a  gadget  is  still  possible.  The  key  to 
proving  this  is  Lemma  8. 


Root 

O 

Sx 


{0} 


S'(p,m,d) 


S  •  {p,m',d') 


{1} 


Figure  2:  A  fork — p  is  the  deciding  process 


Lemma  14:  If  t  is  bivalent  critical  then  T*  has 
at  least  one  decision  gadget  (and  hence  a  deciding 
process). 

6.4  Extracting  the  correct  process 

By  Lemma  13,  there  is  a  critical  index  t.  If  t  is 
monovalent  critical,  Lemma  15  below  shows  how 
to  extract  a  correct  process.  If  t  is  bivalent  crit¬ 
ical,  a  correct  process  can  be  foimd  by  applying 
Lemmata  14  and  16. 

Lenuna  15:  If  t  is  monovalent  critical  then  pi  is 
correct. 

Lemma  16:  The  deciding  process  of  a  decision 
gadget  is  correct. 

There  may  be  several  critical  indices  and  several 
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{Build  and  tag  simulation  forest  T  induced  hyG} 
for  i  *—  0,1, . . . ,  n: 

T*  <—  simulation  tree  induced  by  G  and  /* 
for  every  vertex  S  of  T* 

if  5  has  a  descendent  5'  such  that 
a  correct  process  has  decided  k  in  S'(/*) 
then  add  tag  k  to  S 

[Select  a  process  from  tagged  simulation  forest  T} 
t  smallest  critical  index 
if  i  is  monovalent  critical  then  return  pi 
else  return  deciding  process 

of  the  smallest  gadget  in  T* 

Figure  4:  Selecting  a  correct  process 


decision  gadgets  in  the  simulation  forest.  Thus, 
the  above  Lemmata  may  identify  many  correct 
processes.  Our  selection  rule  will  choose  one  of 
these,  as  the  failure  detector  0  requires,  as  fol¬ 
lows.  It  first  determines  the  smallest  critical  in¬ 
dex  i.  If  t  is  monovalent  critical,  it  selects  pi.  If, 
on  the  other  hand,  t  is  bivalent  critical,  it  chooses 
the  “smallest”  gadget  in  T*  according  to  some  en¬ 
coding  of  gadgets,  and  selects  the  corresponding 
deciding  process.  It  is  e^y  to  encode  finite  graphs 
as  natural  numbers.  Since  a  gadget  is  just  a  finite 
graph,  the  selection  rule  can  use  any  such  encod¬ 
ing.  The  selection  rule  is  shown  in  Figure  4. 

Lemma  17:  Figure  4  selects  a  correct  process. 

6.5  The  reduction  sdgorithm 

The  selection  of  a  correct  process  described  above 
is  not  yet  the  distributed  algorithm  that 

we  are  seeking;  it  involved  an  infinite  simulation 
forest  {ind  it  was  “centralized” .  To  turn  it  into  a 
distributed  algorithm,  we  will  modify  it  as  follows. 
Each  process  will  cooperate  with  other  processes 
to  construct  ever  increasing  finite  approximations 
of  the  simulation  forest.  Such  approximations  will 
eventually  contain  the  gadget  smd  the  other  tag¬ 
ging  information  necessary  to  extract  the  identity 
of  the  same  correct  process  chosen  by  the  selection 
method  in  Figure  4. 

Note  that  the  selection  method  in  Figure  4  in¬ 


volves  three  stages:  The  construction  of  G,  a 
graph  representing  samples  of  failure  detector  val¬ 
ues  and  their  temporal  relationship,  the  construc¬ 
tion  and  tagging  of  the  simulation  forest  induced 
by  G,  and  finally,  the  selection  of  a  correct  process 
using  this  forest. 

Algorithm  consists  of  two  components. 

In  the  first  component,  each  process  repeatedly 
queries  its  failure  detector  module  emd  sends  the 
failure  detector  values  it  sees  to  the  other  pro¬ 
cesses.  This  component  enables  processes  to  con¬ 
struct  ever  increasing  finite  approximations  of  the 
same  G.  Since  all  inter-process  communication 
occurs  in  this  component,  we  call  it  the  communi¬ 
cation  component  of  T£)_>n- 

In  the  second  component,  each  process  repeat¬ 
edly  (a)  constructs  and  tags  the  simulation  forest 
induced  by  its  current  approximation  of  G,  and 
(b)  selects  the  identity  of  a  process  using  its  cur¬ 
rent  simulation  forest.  Since  this  component  does 
not  require  any  communication,  we  call  it  the  com¬ 
putation  component  of  T£)_,n- 


In  this  component  processes  cooperate  to  con¬ 
struct  ever  increasing  approximations  of  the  same 
G.  Let  Gp  denote  p’s  current  approximation  of  G. 
Roughly,  each  process  p  repeatedly  executes:  (i)  If 
p  receives  Gg  for  some  g,  it  incorporates  this  infor¬ 
mation  by  replacing  Gp  with  the  union  of  Gp  and 
Gq.  (ii)  Process  p  queries  its  own  failure  detector 
module.  Let  d  be  the  value  that  it  sees  and  \p',d'] 
be  any  vertex  currently  in  Gp.  Clearly,  p  saw  d  af¬ 
ter  p'  saw  d'.  Thus  p  adds  [p,  d]  to  Gp,  with  edges 
firom  all  other  vertices  of  Gp  to  [p,  d].  Process  p 
then  sends  its  updated  Gp  to  all  other  processes. 
The  communication  component  of  Tjo-^a  for  p  is 
shown  in  Figure  5. 

Recall  that  we  are  considering  a  fixed  run  of 
T£)_n,  with  failure  pattern  F,  and  failure  detector 
history  Hp  €  D{F).  The  communication  compo¬ 
nent  of  Tp-,n  constructs  graphs  that  satisfy  the 
following  properties.  Let  Gp{t)  denote  the  value 
of  Gp  at  time  t. 

Lemma  18:  For  any  correct  process  p  and  t  £T: 

1.  The  vertices  of  Gp{t)  are  of  the  form  [p^d'] 


6.5.1  The  communication  component 
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{Build  ihe  directed  acyclic  graph  Gp} 

Gp  empty  graph 
repeat  forever 
Receive  phase: 
p  receives  m 

Failure  detector  query  phase: 

dp  *—  query  failure  detector  D 
Send  phase: 

if  m  is  of  the  form  (q,Gq,p)  then 
Gp  <—  Gp  U  Gq 

add  [p,  dp]  to  Gp  and  edges  from  all 

other  vertices  of  Gp  to  [p,  dp] 
outputp  *—  computation  component  {Fig.  6} 
p  sends  (p,  Gp,  9)  to  all  9  G  11 

Figure  5:  Process  p’s  communication  component 


where  d'  =  Hoip'tt')  for  some  time 

2.  If  [gi,di]  — ►  [q2id^  is  an  edge  of  Gp{t)  and 
dx  —  Hoiqxitx)  and  dj  =  5b(92>^2)  then 
tx  <  t2. 

3.  Gp(t)  is  transitively  closed. 

4.  There  is  a  time  t'  >  t  and  a  failure  detec¬ 
tor  value  d  such  that  for  all  vertices  [p',  d'j  of 
Gp{t),  [p'jd'j  -♦  [p,d]  is  an  edge  of  Gp{t'). 

5.  Gp{t)  is  a  subgraph  of  Gp{t  +  1). 

6.  For  all  correct  q,  there  is  a  time  t!  >  t  such 
that  Gp(t)  is  a  subgraph  of  Gq{t'). 

Property  5  of  the  above  lemma  allows  us  to  define 
G^  =  Utgr  ^p(^)-  Prom  Property  6,  we  get: 

Lemma  19:  For  any  correct  processes  p  and  q, 
Gf  =  Gf. 

Lemma  19  allows  us  to  define  the  limit  graph  G 
to  be  for  any  correct  process  p.  The  first  four 
properties  of  Lemma  18  imply: 

Lemma  20:  The  limit  graph  G  satisfies  the  four 
properties  of  the  DAG  defined  in  Section  6.1. 


6.5.2  The  computation  component 

Since  the  limit  graph  G  has  the  four  properties  of 
the  DAG,  we  can  apply  the  “centralized”  selection 
method  of  Figure  4  to  identify  a  correct  process. 
This  method  involved: 

•  Constructing  and  tagging  the  infinite  simula¬ 
tion  forest  T  induced  by  G. 

•  Applying  a  rule  to  T  to  select  a  particular 
correct  process  p*. 

In  the  computation  component  of  Tx)_n,  each  p 
approximates  the  above  method  by  repeatedly: 

•  Constructing  and  tagging  the  finite  simula¬ 
tion  forest  Tp  induced  by  Gp,  its  present  finite 
approximation  of  G. 

•  Applying  the  same  rule  to  Tp  to  select  a  par¬ 
ticular  process. 

Since  the  limit  of  Tp  over  time  is  T,  and  the  infor¬ 
mation  necessary  to  select  p*  is  in  a  finite  subgraph 
of  T,  we  can  show  that  eventually  p  will  keep  se¬ 
lecting  the  correct  process  p*,  forever. 

Actually,  p  cannot  quite  use  the  tagging  method 
of  Figure  4:  that  method  requires  knowing  which 
processes  2ire  correct!  Inste^ul,  p  assigns  tag  k  to 
a  vertex  5  in  TJ,  if  and  only  if  S  has  a  descendent 
5'  such  that  p  itself  has  decided  k  in  S' {P).  If 
p  is  correct,  this  is  eventually  equivalent  to  the 
tagging  method  of  Figure  4.  If  p  is  faulty,  we  do 
not  care.  Also,  p  cannot  use  exactly  the  same 
selection  method  as  that  of  Figure  4:  its  current 
simulation  forest  Tp  may  not  yet  have  a  critical 
index  or  contain  any  deciding  gadget  (although  it 
eventually  will!).  In  that  case,  p  temporizes  by 
just  selecting  itself.  The  computation  component 
of  To->{t  is  shown  in  Figure  6.  Let  Tp(t)  denote 
Tp  at  time  t. 

Lemma  21:  For  any  correct  p  and  any  t  ^  T: 

1.  Tp(t)  is  a  subgraph®  of  T 

2.  Tp(t)  is  a  subgraph  of  Tp(t  -I- 1) 

3.  limT_(t)  =  T 

t-»oo  ^  ' 

*The  subgraph  relation  ignores  the  tags. 
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{Build  and  tag  simulation  forest  Tp  induced  hy  Gp} 
for  t  <—  0, 1, . . .  ,n: 

Tp  *—  simulation  tree  induced  by  Gp  and  P 
for  every  vertex  5  of  TJ, 

if  S  has  a  descendent  S'  such  that 

p  has  decided  k  in  S\P) 
then  add  tag  k  to  S 

{Select  a  process  from  tagged  simulation  forest  Tp} 
if  there  is  no  critical  index  then  return  p 
else 

t  *-  smallest  critical  index 
if  t  is  monovalent  critical  then  return  pi 
eke  if  T],  has  no  gadgets  then  return  p 
eke  return  deciding  process 

of  the  smallest  gadget  in  Tp 

Figure  6:  Process  p’s  computation  component 

Lemma  22:  For  any  correct  p  and  any  vertex  S 
ofTp: 

1.  p  never  removes  a  tag  from  S. 

2.  There  is  a  time  after  which  the  tags  of  5  in 
Tp  will  always  be  the  same  as  the  tags  of  S 
in  T. 

Theorem  23:  For  any  correct  process  p,  there  is 
a  time  after  which  output^  =  p*,  forever. 

Theorem  2:  For  all  environments  £,  if  a  failure 
detector  D  can  be  used  to  solve  Consensus  in  €, 
then  D  fl. 
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Abstract  1  Introduction 


Most  mutual  exclusion  algorithms  reqiiire  0(n) 
operations  to  enter  the  critical  section  despite 
how  many  may  be  actively  trying  to  enter  the 
critical  section.  This  paper  presents  a  mutual 
exclusion  algorithm  that  is  much  more  sensitive 
to  how  many  processes  currently  want  to  en¬ 
ter  the  critical  section.  This  algorithm  is  based 
on  a  parameter  /,  and  assumes  there  is  a  to¬ 
tal  of  n  =  processes.  If  only  one  process 
wants  to  enter  the  critical  section,  1+7  oper¬ 
ations  are  sufficient.  If  there  are  t  processes 
currently  wanting  to  enter  the  critical  section, 
then  0{tlk)  operations  are  all  that  is  necessary. 
This  is  ususally  much  less  than  the  0{n)  op¬ 
erations  required  by  ordinary  mutueil  exclusion 
algorithms. 

If  the  need  is  only  to  elect  one  process,  the 
same  minimum  of  8  operations  will  hold,  but 
the  number  of  variables  used  will  be  C7(log  f),  if 
t  processes  are  contending  to  be  elected.  This 
algorithm  will  also  by  symmetric,  with  no  dis¬ 
tinctions  between  processes. 
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The  question  of  mutual  exclusion  is  a  very  old 
problem,  dating  from  Dijkstrata  in  1965  [2]. 
However,  most  of  the  published  solutions  re¬ 
quire  aja  incoming  process  to  look  at  every  other 
potential  competitor  as  part  of  the  mutual  ex¬ 
clusion  process.  If  the  number  of  such  com¬ 
petitors  is  fairly  small  this  is  not  a  major  prob¬ 
lem,  but  some  large  systems  (such  as  an  air¬ 
line  database)  could  have  thousands  of  pro¬ 
cesses  that  might  want  to  examine  or  change 
the  database.  In  this  case,  the  time  to  check 
for  competitors  becomes  a  significant  part  of  the 
time  required  for  the  mutual  exclusion  problem. 

Leslie  Lamport  [3]  came  up  with  a  solution 
to  this  problem,  called  Fast  Mutual  Exclusion. 
His  algorithm  allows  a  process  to  enter  the  crit¬ 
ical  section  in  a  constant  number  of  operations, 
regardless  of  the  number  of  potential  competi¬ 
tors,  since  this  process  is  the  only  one  currently 
attempting  to  enter  the  critical  section. 

Lamport’s  solution  has  a  problem  when  there 
are  two  (or  more)  processes  seeking  the  critical 
section.  When  two  processes  start  to  compete, 
then  it  becomes  necessary  for  every  process  to 
be  checked  to  identify  the  competitor.  This 
means  the  time  to  enter  is  either  0(1)  if  you 
are  alone,  or  0{N)  if  two  processes  are  trying 
to  enter.  The  attempt  of  this  paper  is  to  exam¬ 
ine  solutions  to  the  mutdal  exclusion  problem, 
and  to  determine  the  extent  to  which  the  gap 
between  0(1)  and  0{N)  can  be  narrowed.  Ide- 
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ally  we  would  want  the  time  to  be  0{t),  where  t 
is  the  number  of  processes  competing  for  the  re¬ 
source.  This  paper  does  not  acheive  this  bound, 
b’ t  it  does  narrow  the  gap  between  one  process 
and  several  processes. 

2  Model 

The  model  used  in  this  paper  is  a  variation  of 
the  one  given  in  Burns  [1].  Let  a  system  be  a 
tuple  5  =  {P,  V,  0, 9o),  where  P  is  a  finite  set 
of  processes,  V  is  a  finite  set  of  variables,  Q  is  a 
set  of  system  states,  (/i  is  a  transition  function, 
and  qo  is  the  specified  initial  state,  with  all  pro¬ 
cesses  in  their  remainder  section  and  all  vari¬ 
ables  holding  appropriate  values.  Let  N  =  \P\ 
be  the  number  of  processes  and  Af  =  |y|  be  the 
number  of  global  variables  in  S.  We  will  assign 
each  process  has  a  unique  identifier,  and  assign 
variables  to  processes  or  groups  of  processes. 

2.1  Process  State 

The  state  of  each  process  will  be  represented 
by  a  step  that  indicates  where  that  process  is 
within  the  algorithm.  Let  Xi  be  the  set  of 
steps  in  the  algorithm  that  process  Pi  will  take. 
Xi  C2m  be  partitioned  into  Hi,  Ti,  Cj,  and 
standing  for  the  remainder  region,  trying  (en¬ 
try)  region,  critical  section  and  exit  region  re¬ 
spectively  of  process  Pi.  Also  let  X  =  Uj  Xj  be 
the  steps  of  all  processes,  and  define  iZ,  T,  C 
and  P  in  a  similar  manner. 

Yk  will  be  used  to  represent  the  set  of  possible 
values  for  variable  14 .  By  making  the  assump¬ 
tion  that  a  process  can  only  read  or  write  one 
variable  per  step,  we  can  partition  the  set  of 
steps  T\JE  into  disjoint  sets  Readk  and  Writtk- 
These  sets  represent  reads  and  writes  respec¬ 
tively.  Any  other  step  would  represent  an  inter¬ 
nal  operation,  and  is  not  considered  separately. 
There  is  one  pair  of  sets  Readk  a^d  Writck  for 
each  variable  14,  representing  the  steps  that  can 
read  and  write  that  variable.  All  variables  can 
be  read  or  written  by  any  process,  although  the 


algorithm  may  hmit  who  will  use  a  given  vari¬ 
able. 

Let  Read  =  \Jk  Readk  aJtid‘  Write  = 
\JkWritek.  An  system  state  (often  just 
called  state)  of  5  is  a  (M  -H  iV)-tuple  q  = 
(xo,Xi,...xs-i,Vi,V2,...vm),  with  Xi  6  Xi 
and  Vk  €  34-  For  each  process  Pi,  x,  is  the  step 
it  is  currently  about  to  execute  and  Vk  is  the 
value  of  variable  14.  The  notation  Xi{q)  =  x,- 
will  indicate  the  step  of  a  process  Pi  in  state  q, 
and  Vk{q)  =  Vk  is  the  value  of  the  variable  Vk 
in  state  q.  Let  Q  be  the  set  of  all  such  system 
states  of  S. 

2.2  Moves  and  Schedules 

Define  a  move  function  <l>:  Q  x  |P|  Q.  This  is 
a  total  function,  with  (f>{q,  i)  being  the  state  re¬ 
sulting  from  initially  being  in  state  q,  and  then 
letting  Pi  take  one  step.  A  schedule  is  a  se¬ 
quence  h  =  ixH...  (finite  or  infinite)  of  pro¬ 
cess  indices.  Define  <l>(q,  h)  in  the  usual  m<in- 
ner  by  <^(9,  h)  =  <l>{(j>{q,  ij),  *2*3  •  •  •)•  infinite 
schedule  h  is  admissible  from  q  if  no  process 
can  stop  outside  its  remainder  region.  A  state 
is  reachable  ilq"  =  <f>{q,  h)  for  some  admissible 
schedule  h.  For  every  finite  prefix  hi  of  h  with 
^1))  ^  ^  there  exists  a  finite  prefix  hih2 
of  h  such  that  i  occurs  in  /i2* 

The  initial  state  qo  must  conform  to  the  re¬ 
quirements  that  Xi{qo)  €  Ri  for  all  i. 

2.3  Required  Conditions 

The  following  conditions  enforce  our  intuitive 
ideas  about  deterministic  asynchronous  sys¬ 
tems.  For  aU  g  €  Q,  all  i,  j  G  [0 ...  A"  —  1]  and 
all  9'  G  Q  with  x,(g)  =  Xi{<f)  then  the  following 
must  hold; 

1.  For  all  j  /  i:  Xj{q)  =  Xj{<f){q,i)). 

2.  If  Xi{q)  G  iZ  U  T,  then  Xi{<f>{q,  t))  G  T  U  C 

3.  If  Xi{q)  G  C  U  fJ,  then  Xi{<l>{q,  i))  G  F  U  iZ 

4.  If  Xi{q)  G  Readk  and  Vk{q)  =  «*(?'),  then 

=  Xi{<f>{cf  ,i)). 


160 


3  Algorithm 


5.  If  Xi{q)  ^  Read  then  Xi{<f>{q,i))  = 

*))• 

6.  if  Xi{q)  €  Writek,  then  Vk{4>{q,i))  = 

7.  If  Xi{q)  ^  Writtk  then  Vk{q)  =  Vk{4>{q,  *)) 

These  conditions  enforce  the  various  intuitive 
expectations  that  we  have  for  a  deterministic 
asynchronous  system.  Condition  1  prevents  one 
process’  move  from  affecting  any  other  process. 
The  next  two  conditions  (2  and  3)  only  permit 
looping  while  entering  or  exiting.  Details  of  the 
remainder  section  and  critical  section  are  not 
important  here  and  are  suppressed. 

Conditions  4  and  5  mean  a  process  can  only 
change  its  state  based  on  the  value  read  from 
a  variable,  and  cannot  make  any  choices  other¬ 
wise.  A  variable  may  only  change  value  when  a 
process  writes  to  it  (condition  6),  and  the  new 
value  depends  only  on  the  state  of  the  writing 
process.  No  variable  may  change  value  except 
when  some  process  performs  a  write  to  that 
variable  (condition  7). 

2.4  Mutual  Exclusion 

A  system  5  satisfies  mutual  exclusion  if  for 
all  reachable  states  q  €  Q  Xi{q)  €  C  and 
X_,(g)  G  C  imply  z  =  j.  A  system  S  is  deadlock 
free  if  for  all  reachable  states  q  E  Q  and  every 
admissible  schedule  h,  then  for  some  prefix  h'  of 
h  either  Xi{<f>{q,  A'))  €  C  for  some  process  Pi  or 
Xi{(f>{q,  h'))  €  -R  for  all  i.  In  particular,  sched¬ 
ules  involving  only  processes  already  in  the  pro¬ 
tocol  can  continue  to  make  progress.  A  system 
5  is  lockout-free  if  for  all  reanhable  states  q  £  Q 
and  all  admissible  schedules  h  firom  q,  then  for 
all  processes  Pi  Xi{q)  0  R  implies  there  is  a  fi¬ 
nite  prefix  A'  of  A  such  that  Xi{(f>{q,  A'))  €  R. 
This  prevents  a  process  from  being  tied  up  in¬ 
definitely  while  trying  to  enter  or  exit  the  crit¬ 
ical  section.  The  conditions  2  and  3  above  im¬ 
ply  that  a  process  in  its  trying  region  must  go 
through  the  critical  region. 


The  algorithm  I  present  is  a  variation  of  algo¬ 
rithms  by  Gary  Peterson  [4]  and  Lamport  [3], 
with  additional  variables  to  assist  in  finding  any 
process  without  having  to  check  every  process 
individually.  These  algorithms  work  by  having 
processes  first  try  to  enter  the  critical  section 
by  trying  a  "fast  procotol”  that  uses  the  min¬ 
imum  number  of  operations  but  assumes  there 
is  only  one  process  active.  If  other  processes 
are  also  be  trying  to  enter  the  critical  section, 
then  a  "slow  protocol”  is  entered  that  uses  a 
more  conventional  mutual  exclusion  algorithm 
to  control  access  to  the  critical  section. 

In  this  algorithm,  each  process  has  an  as¬ 
signed  variable  Wi  that  has  three  values:  FAST, 
SLOW  and  OUT.  A  process  sets  Wi  to  FAST  to 
indicate  it  is  in  the  fast  version  of  the  mutual  ex¬ 
clusion  protocol.  More  accurately,  FAST  means 
it  has  not  announced  it  is  finished  or  switching 
to  the  slow  route.  SLOW  means  the  process 
is  trying  to  enter  the  critical  section,  and  has 
already  decided  that  it  cannot  use  the  fast  pro¬ 
tocol  due  to  contention  from  other  processes, 
OUT  means  the  process  has  exited  the  proto¬ 
col,  and  is  not  currently  attempting  to  enter 
the  critical  section.  Any  process  Pi  that  is  in 
its  remainder  section  wUl  have  Wi  =  OUT. 

Between  the  individual  vaxiables  Wk  and  the 
global  vajiables  that  aU  processes  write,  there 
are  /  —  1  levels  of  variables  intended  to  help 
processes  in  determing  who  their  competitors 
are.  Assume  we  have  n  =  k^  total  processes 
(if  n  <  k^,  extra  ‘dummy’  processes  can  be 
included  that  only  stay  in  their  reminder  sec¬ 
tion).  These  variables  will  be  called 
where  j  is  the  current  level  (1  to  /  —  1),  and 
3ub{i,j)  =  \i/k^\  (assuming  processes  are  num¬ 
bered  0  to  n  —  1).  Each  variable  in  level  1  can  be 
written  by  any  of  a  group  of  k  processes,  each 
variable  in  level  2  has  an  associated  group  of 
k^  processes,  and  each  variable  in  level  m  has 
A”*  processes  that  can  write  it.  At  level  m,  a 
process  i  is  in  group  [f/A”*]. 

These  vairiables  are  written  as  a  process  is 
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entering  the  protocol,  and  each  process  points 
each  vaxiable  to  itself.  These  may  be  overwrit¬ 
ten  by  later  processes,  but  not  in  a  way  that 
can  hide  a  fast  route  process  (proof  later).  The 
initial  values  are  not  important,  and  can  point 
to  any  process  in  the  group. 

This  method  of  grouping  (k,k'^,k^,...)  in 
successive  levels  is  also  used  in  the  slow  mutual 
exclusion  algorithm.  To  decide  the  winner  of 
the  slow  mutual  exclusion  path,  processes  first 
do  regular  mutual  exclusion  within  their  group 
of  k.  In  level  2  sets  of  k  winners  (the  ones  using 
each  52,,u6(i,2))  compete  among  themselves  to 
choose  one  of  k^  processes  to  proceed  to  level  3. 
This  process  (choosing  one  if  k  processes)  con¬ 
tinues  for  each  of  the  /  levels,  until  one  overall 
winner  remains. 

Lemma  3.1  The  first  violation  of  mutual  ex¬ 
clusion  cannot  be  due  to  two  processes  that  both 
took  the  fast  route  into  the  critical  section. 

Proof;  Suppose  two  processes  Pi  and  Pj  were 
both  simultaneously  in  the  critical  section,  and 
both  used  the  fast  route  to  get  there.  Also  sup¬ 
pose  Pi  was  the  first  of  the  two  to  set  Turn  :=  i. 
For  Pi  to  take  the  fast  route  into  the  criti¬ 
cal  section,  then  Pi  must  have  made  the  check 
‘if  turn  =  i’  before  Pj  changed  Turn.  How¬ 
ever,  Pi  sets  Lock  :=  true  as  part  of  entering 
the  critical  section.  For  Pj  to  continue  in  the 
fast  route,  it  must  find  Lock  to  be  false.  Re¬ 
member  A  process  only  clears  Lock  when  ex¬ 
iting  the  critical  section.  If  Pi  cleared  Lock, 
then  it  has  finished,  and  Pj  can  safely  enter  the 
critical  section.  Any  third  process  Pk  clearing 
Lock  would  make  a  potential  violation  of  mu¬ 
tual  exclusion  (P,-  and  Pk)  prior  to  the  first  one 
to  occur,  a  contradiction.  Therefore  Pj  will  find 
Lock  is  true,  and  will  turn  to  the  slow  route.  □ 

Lemma  3.2  A  process  Pi  cannot  be  the  first 
to  enter  the  critical  section  by  the  slow  route  if 
there  is  a  process  Pj  capable  of  entering  by  the 
fast  route. 

Proof: 


Wi  :=  Fast; 

Turn  :=  i; 

if  Lock  then  goto  Aside; 
for  j  :=  1  to  /  —  1  do 
•“  ij 

Lock  :=  true; 

if  Tum^  i  or  Block  then  goto  Aside; 

C.  S. 

Lock  :=  false; 

Wi  :=  Out; 

Aside:  Wi  :=  Slow; 

for  j  :=  0  to  /  —  1  do 

Entry(  j,  sub(ijH-l),  sub(io)  ); 

Block  :=  true; 

Check_vars(  1-1,  0  ); 

C.  S. 

Lock  :=  false; 

Block  :=  false; 

Wi  :=  Out; 

for  j  :=  /  —  1  downto  0  do 
Exit(  j,  sub(ij+l),  sub(ij)  ); 

Procedure  Check_vars(  Lev,  Start  ) 
begin 

if  Lev  >  0  then 
for  j  :=  0  to  A:  —  1  do 

wait  until  Fast 

if  =  Slow  then 

Check_vars(  Lev-1,  (Start-fj)*fc  ) 

else 

for  j  :=  0  to  A:  —  1  do 

wait  until  Wstart+j  ^  Fast 

end 

Procedure  Entry ( level,  Vjuts,  MYJd  ) 
Do  normal  mutual  exclusion  entry  <is 
Pmym  using  variables  l^et»e/,Var* 

Procedme  Exit(  level,  Vars,  MYJd  ) 

Do  normal  mutual  exclusion  exit  as 
Pmyjs  using  variables  Vievd.Var, 

Figure  1:  Improved  Fast  Mutual  Exclusion  Al¬ 
gorithm 
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•  Case  1:  Pj  has  already  written  its  ID  to 

Si-i^ub{j,i-iy  Consider  the  current  value 
of  If  =  j,  then 

Pi  will  see  Wj=FAST,  and  waits  inside 
of  Check_vars  until  Wj  changes  to  SLOW 
or  OUT.  if  =  t  with  t  ^  j, 

then  if  Wt=FAST,  Pi  stiU  waits,  and  if 
Wt=SLOW,  Pi  win  examine  the  previous 
(level  1  —  2)  set  of  variables  Si-2^  for  aU 
processes  that  might  have  been  hidden  by 
Pt.  By  a  similar  argument  at  level  /  —  2, 
either  Pj  is  visible  at  this  level,  or  further 
checking  is  required.  Eventually  individual 
variables  will  be  reached,  so  P,-  wiU  wait  for 
Pj,  or  for  another  process. 

The  variable  Wt  cannot  only  have  the  value 
OUT  if  Pt  has  already  exited  the  critical 
section.  Process  P^  would  have  had  to 
have  written  after  Pj,  and  by 

Lemma  3.1  it  cannot  have  taken  the  fast 
route,  and  if  Pt  took  the  slow  route  then 
Pi  would  not  have  the  first  to  be  in  this 
situation. 

•  Case  2:  Pj  has  not  written  -S';_i,«ui(j,j_i). 
In  this  case.  Pi  may  not  discover  Pj,  but 
since  P,-  sets  Block  to  true  before  doing  the 
check,  Pj  ca,nnot  enter  the  critical  section 
by  the  fast  route  because  of  Block.  After 
Pi  exits,  Pj  will  again  be  capable  of  enter¬ 
ing  by  the  fast  path,  but  only  after  Pj  has 
exited  the  critical  section. 

In  either  case,  there  cannot  be  a  fast  path  pro¬ 
cess  in  the  critical  section  along  with  a  slow 
path  process.  □ 

Lemma  3.3  If  the  mutual  exclusion  routine 
that  is  pari  of  Enter/Exit  does  not  have  dead¬ 
lock  or  lockout,  then  this  algorithm  does  not  ei¬ 
ther. 

Proof:  Since  a  process  that  is  still  trying  the 
fast  route  never  waits  for  anything,  clearly  such 
a  process  cannot  be  waiting  indefinitely.  There¬ 
fore  any  processes  that  is  waiting  indefinitely 
must  be  doing  so  in  the  slow  route.  A  slow  route 


process  first  sets  Wi  to  slow  and  then  proceeds 
through  a  finite  set  of  mutual  exclusion  algo¬ 
rithms.  By  assumption,  a  process  will  even¬ 
tually  exit  each  individual  level  of  the  mutual 
exclusion  algorithm.  These  leveb  do  not  inter¬ 
act  since  each  has  a  separate  set  of  variables, 
and  each  level  is  in  the  ‘critical  section’  of  the 
next  outer  level.  Once  a  slow  route  winner  has 
been  chosen,  it  sets  Block  to  true.  Therefore 
amy  Pj  thas  has  not  yet  started  attempting  to 
enter  the  critical  section  will  find  Block  is  true. 
So  Pj  will  choose  the  slow  route  and  be  stopped 
by  the  mutual  exclusion  algorithm.  Therefore 
only  a  finite  number  of  processes  can  enter  the 
critical  section  after  Block  was  set  to  true,  and 
the  slow  route  process  that  set  Block  can  enter 
the  critical  section.  □ 

Please  note  that  an  arbitrary  number  of  fast 
route  processes  can  pass  through  the  critical 
section  between  two  successive  slow  route  pro¬ 
cesses.  This  can  be  prevented  by  having  each 
process  checks  for  other  slow  route  processes  be¬ 
fore  clearing  Block  (unless  the  Entry/Exit  rou¬ 
tine  allows  it).  However,  an  algorithm  that  has 
a  First-in-first-out  (or  similar)  property  loses 
that  property  even  among  the  slow  route  pro¬ 
cesses. 

Lemma  3.4  Not  counting  operations  done 
as  part  of  a  "wait  until”  loop,  at  most 
0{min{n,tlk))  operations  are  required  for  a 
slow  route  process. 

Proof;  A  count  of  the  variables  shows  there  are 
0{n)  variables,  each  of  which  is  read  or  written 
a  constant  number  of  times  (excepting  the  wait 
until  loops).  We  can  see  this  by  noting  there 
sxc  n  individual  variables,  k  *  0{n/k)  =  0{n) 
variables  for  the  first  level  of  mutual  exclusion 
routines,  k  *  0{n/k^)  variables  in  the  second 
round,  down  to  0{k)  in  the  final  round,  plus  a 
constant  number  of  other  variables.  Each  vari¬ 
able  is  used  a  constant  number  of  times  (ex¬ 
cluding  waiting  loops).  Adding  these  together 
shows  the  0{n)  part  of  the  bound. 

To  show  the  0{tlk)  bound,  the  main  prograim 
uses  a  constant  number  of  operations,  the  En- 
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try  procedure  can  be  executed  /  times  for  0{k) 
operations  each  time  (assuming  a  0(n)  oper¬ 
ation  bound  for  n-process  mutual  exclusion). 
The  Check_vars  procedure  does  0(k)  operations 
each  time  it  is  called.  There  is  one  initial  call, 
then  if  VVi„=SLOW  Check-vars  can  be  called 
once  for  each  level  /.  At  most  t  processes  axe 
contenting  for  the  critical  section  so  any  other 
process  will  show  VVi,i=OUT.  Therefore  a  pro¬ 
cess  can  call  Check_vars  at  most  0(tl)  times, 
for  a  total  operation  bound  of  0{tlk).  □ 

Theorem  3.1  The  algorithm  in  Figure  1 
maintains  mutual  exclusion  with  1+7  fast  route 
operations  and  0{tlk)  slow  route  operations  (ex¬ 
cluding  waits) 

Proof:  Mutual  exclusion  is  assured  by  the  com¬ 
bination  of  Lemmas  3.1,  3.2  and  the  correctness 
of  the  mutual  exclusion  algorithm  embedded  in 
the  procedure  Entry.  The  fast  route  operation 
count  is  a  simple  count  of  operations  (8  fixed, 
/  —  1  additional),  and  the  slow  route  coimt  is 
from  Lemma  3.4.  □ 

It  should  be  noted  that  the  correctness  of  this 
algorithm  does  not  depend  on  all  of  the  groups 
being  of  the  same  si2e.  Therefore  the  size  and 
members  of  each  group  can  be  varied  to  improve 
average  time  if  not  all  processes  are  expected 
to  attempt  to  enter  the  critical  section  equally 
often. 

Conjecture  3.1  0{tlk)  is  a  lower  bound  for 
this  problem. 

My  reasoning  for  this  lower  bound  is  that  for 
programs  of  this  type,  failing  to  check  all  of 
the  earlier  variables  would  allow  a  process 
Pi  to  be  ‘hidden’  by  carefully  placed  other  pro¬ 
cesses  that  overwrite  key  variables,  so  an  incom¬ 
ing  process  fails  to  notice  Pi.  This  failure  allows 
Pi  to  enter  by  the  fast  route  while  another  pro¬ 
cess  has  entered  by  the  slow  route.  Allowing  the 
variables  to  be  used  in  a  more  general  fashion 
takes  a  messy  situation  and  confuses  it  to  the 
point  a  proof  has  proven  elusive. 


4  Fast  Election 

In  considering  the  problem  of  election,  shorter 
and  simpler  algorithms  can  be  found  because 
processes  cannot  exit  and  attempt  to  enter  the 
critical  section  again.  The  algorithm  and  proof 
are  based  on  the  algorithm  for  symmetric  elec¬ 
tion  in  Styer  and  Peterson  [5].  This  algorithm 
will  use  5  variables  and  8  operations  in  the 
absence  of  contention.  If  t  processes  are  con¬ 
tending,  2[logf]  +  3  variables  axe  used.  If  al¬ 
most  every  process  (<  approximately  equaJ  to  n) 
contends  for  election,  a  second  upper  bound  of 
2  flog  n]  —  3  variables  also  applies.  The  number 
of  operations  (again  excluding  waiting  loops) 
will  also  be  O(logt). 

This  algorithm  is  symmetric.  A  symmetric 
algorithm  is  one  where  the  only  difference  be¬ 
tween  processes  is  the  presence  of  an  identifier 
that  can  be  compared  for  equality.  The  ‘vari¬ 
able’  me  wifi  hold  that  identifier.  Every  process 
has  the  same  exact  program,  and  cannot  exam¬ 
ine  two  identifiers  except  to  check  if  they  are 
equal.  To  phrase  this  another  way,  in  a  sym¬ 
metric  system  we  can  exchange  any  two  pro¬ 
cesses  in  a  schedule  without  any  third  process 
knowing  the  difference.  Variables  can  hold  an 
identifier  or  any  of  a  fixed  set  of  constants.  If 
a  variable  can  hold  any  identifiers,  it  must  hold 
them  all.  Each  variable  must  start  with  a  value 
that  is  a  constant  (typically  0). 

To  make  this  more  formal,  define  Xriq)  — 
Xg(^)  to  mean  that  two  process  states  Xr{q)  and 
Xs{^)  are  identical  except  that  wherever  Pr  hcis 
the  identifier  of  Pi,  then  F,  has  Pfs  identifier, 
eind  vice  versa.  Then  for  any  schedule  k,  create 
the  schedule  h'  by  swapping  all  occurrences  of  i 
and  j.  Symmetry  then  requires  that  the  result 
of  schedules  h  and  ft'  must  be  the  same  except 
for  i  and  j. 

•  For  any  r  not  equal  to  i  or  j,  XT{<f>(qo,  ft))  = 

Xr{(l>{qo,h')). 

•  Xi{4>{qo,h))  =  Xj{<f>{qo,h*)). 

•  Xj{<l>{qo,h))  =  Xi(<i>(qo,h')). 
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•  H  t;„(^(go,M)  =  then  Vm{<f>iqo,h'))  = 
IDj. 

•  If  Vm{<l>iqo,h))  =  IDj  then  v„,(<f>{qo,  h'))  = 
IDi. 

•  If  Vm{<f>iqoy  h))  is  not  IDi  or  IDj,  then  then 

See  Burns  [1]  and  Styer  and  Peterson  [5])  for 
a  more  formal  definition. 

In  the  algorithm  given  in  figure  2,  each  vari¬ 
able  starts  out  with  the  value  0.  Then  the  pro¬ 
cess  that  last  wrote  turn  attempts  to  enter  the 
critical  section.  Other  process  give  priority  to 
the  newcomer,  and  erase  any  writes  they  have 
meide.  The  newcomer  waits  for  eetch  of  the  vari¬ 
ables  Vi  to  be  0,  and  then  writes  its  own  ID 
into  the  variable.  If  it  ever  notices  that  turn 
no  longer  equals  itself,  it  then  resets  any  Vi  stiU 
holding  its  ID  back  to  0.  The  processes  use  a 
parallel  set  of  variables  Ci  when  contention  is 
known  to  have  occurred  at  this  level  (two  pro¬ 
cesses  simultaneously  at  or  beyond  this  level). 
If  some  Vi  has  been  overwritten  by  some  other 
process,  it  is  left  alone  for  that  other  process  to 
handle.  The  first  process  can  easily  enter  the 
critical  section.  A  second  process  can  enter  the 
critical  section  only  the  help  of  a  process  al¬ 
ready  at  that  level  (which  turns  out  not  to  help 
at  all),  or  itself  and  one  other  process  at  the 
previous  level,  which  would  require  2*°*”  =  n 
processes,  but  only  n  —  1  are  available.  Also  we 
have  a  shortcut  to  election  for  when  only  a  few 
processes  are  active.  It  will  be  proved  that  if 
level  I  is  reached  with  no  contention  visible  at 
level  /  —  2,  then  that  process  is  sufficiently  far 
enough  ahead  to  safely  declare  itself  elected. 

The  formal  proof  requires  an  ax^counting  sys¬ 
tem  of  credits  to  prove  mutual  exclusion,  where 
particular  process  and  variable  states  are  as¬ 
signed  a  value  in  terms  of  these  credits.  To 
progress  to  a  given  point  within  the  algorithm,  a 
process  must  collect  a  specified  number  of  cred¬ 
its.  Then  it  is  shown  that  there  are  not  enough 
credits  available  for  two  processes  to  simultane¬ 
ously  declare  themselves  elected. 


turn  :=  me 

for  level  :=  1  to  flogn]  do 

wait  until  Vievti  =  0 

^levd 

if  turn  ^  me  then 
for  j  :=  1  to  level  do 
if  Vj  =  me  then 
Vj  ;=0 
else 
C7,-  :=  1 

Halt 

if  Cievei-2  =  0  then 
Announce  Elected 
Announce  Elected 

Figure  2:  flog  n]  -|- 1- Variable  Symmetric  Elec¬ 
tion  Algorithm 

Lemma  4.1  The  symmetric  election  algorithm 
in  Figure  2  has  no  deadlock. 

Proof;  Suppose  the  system  is  deadlocked. 
Since  each  process  only  writes  turn  once  while 
entering,  there  must  be  a  last  process  to  write 
turn,  say  process  P,.  Since  Pi  C£innot  make 
progress,  it  must  be  stuck  waiting  for  Vf  =  0  for 
some  1.  Each  process  can  write  Vj  at  most  twice 
(once  to  me  and  once  to  O'),  so  some  process 
must  have  been  the  last  to  set  V/.  At  this  point, 
consider  the  process  with  the  highest  value  of 
level  that  heis  not  noticed  turn  ^  me.  A  pro¬ 
cess  clears  all  the  V’s  holding  its  ID  as  soon  as 
it  notices  turn  /  me,  so  such  a  process  (call  it 
Pj)  must  exist.  Pj  cannot  be  blocked  since  there 
are  no  higher  processes,  so  it  will  be  able  to  see 
turn  ^  me  and  set  any  variables  it  has  written 
to  0.  Repeat  the  argument  with  the  new  high¬ 
est  process  until  Pi  can  proceed.  Therefore,  this 
system  does  not  have  deadlock.  □ 

Proving  that  only  one  process  can  be  elected 
is  not  so  straightforwMd,  especially  as  the  nor¬ 
mal  technique  of  showing  e2M:h  stage  ehminates 
half  the  processes  remaining  does  not  work.  In¬ 
deed,  it  is  possible  to  arrange  a  schedule  so  that 
every  process  sets  every  variable,  reaching  the 
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final  test  of  turn  before  discovering  another  pro¬ 
cess  has  just  changed  it.  The  fact  that  a  process 
can  clear  several  variables,  as  happens  above, 
would  suggest  that  an  exiting  process  can  pro¬ 
vide  too  much  help  to  other  processes,  and  al¬ 
low  a  violation  of  mutual  exclusion.  In  the  next 
lemma  we  see  that  this  can  only  happen  in  lim¬ 
ited  situations.  In  particular,  it  can  only  hap¬ 
pen  in  place  of  continuing  to  the  next  level. 
Therefore  one  process  clearing  several  variables 
does  not  cause  the  processes  to  violate  mutual 
exclusion. 

Each  time  Vievel  =  0  and  before  it  is  assigned 
a  nonzero  value  will  be  called  a  window.  It  is 
only  during  a  window  that  a  process  can  get 
past  ‘while  Vievei  ^  0’  and  continue  to  the  next 
level  or  abort  and  let  others  continue.  One  or 
more  processes  may  see  this  window  and  pro¬ 
ceed  before  any  of  them  sets  Vievei  •=  fne.,  but 
the  following  lemma  limits  what  can  happen  af¬ 
terward. 

Lemma  4.2  For  any  window  on  Vuvei,  oi  most 
one  process  can  change  any  variable  other  than 
Vlevtl- 

Proof:  Suppose  a  set  of  processes  P\,p2i  •  •  •  Pk 
all  see  Vievei  =  0  and  get  past  ‘while  Vievei  0’ 
before  the  window  closes.  One  of  these  (say 
P\)  must  have  been  the  last  to  set  turn.  If 
level  =  1,  every  other  process  will  see  turn  ^ 
me,  possibly  clear  Vi,  and  exit.  Therefore  as¬ 
sume  level  >  1.  Since  all  of  the  other  pro¬ 
cesses  saw  turn  =  me  at  the  previous  level, 
each  of  them  was  beyond  the  ‘if  turn  ^  me’  test 
when  Pi  executed  turn  :=  me.  But  Pi  reached 
‘while  Vievei  0’  before  «iny  of  the  others  could 
execute  Vi^vei  :=  we,  writing  Vi,V2, . .  .Vievei-i 
in  the  process.  When  the  other  processes  check 
‘if  turn  ^  me’,  it  will  be  true,  so  these  pro-' 
cesses  will  clear  any  variable  holding  their  id. 
But  in  checking  Vi,  V^, . . .  V/eveZ-ij  their  id  has 
been  overwritten  (and  is  never  restored),  so 
these  processes  can  only  execute  Vj  :=  0  for 
j  =  level.  Pi  may  continue  to  the  next  level  or 
notice  turn  ^  me  and  clear  some  or  all  of  the 


Vi’s,  but  is  the  only  process  that  can  change  any 
variable  other  than  Vievei- 

Corollary  1  For  each  Vievei  ■=  0,  at  most  one 
process  can  go  to  the  next  level  or  clear  multiple 
Vj’s. 

Lemma  4.3  If  there  are  two  or  more  processes 
simultaneously  at  or  beyond  level  I,  then  before 
the  second  process  arrives  the  variable  C'j_2  will 
have  already  been  set  to  1. 

Proof:  Consider  what  events  can  taJce  place 
at  level  1—1  without  C1-2  being  set.  By 
Lemma  4.2,  only  one  process  in  any  window  can 
continue  or  find  V1-2  =  me.  So  for  any  window 
on  V5_2  either  this  process  is  alone,  or  any  other 
processes  have  not  yet  started  clearing  Vs.  If  a 
process  continues,  then  V/_2  prevents  other  pro¬ 
cesses  from  continuing,  and  if  it  stops,  then  it 
leaves  the  algorithm.  Either  way,  there  cannot 
be  two  processes  at  level  I  (but  there  can  be  at 
level  1  —  1).  Therefore  if  there  are  two  processes 
at  level  1,  there  must  have  been  at  least  least 
one  process  Pj  at  level  1—1  which  cannot  set 
V5_i  back  to  0.  This  allows  Pj  to  set  C1-2  to  1, 
and  change  Vi-i.  □ 

This  next  lemma  is  critical  in  proving  mutual 
exclusion  when  all  N  processes  are  active.  It 
is  also  used  to  show  the  number  of  variables 
used  (but  not  correctness)  when  some  process 
declares  itself  elected  before  all  the  top  level  is 
reached. 

This  lemma  introduces  the  idea  of  ‘credits’, 
which  cure  used  here  to  represent  how  much 
progress  a  given  process  has  made.  In  order 
to  make  progress,  other  processes  haxl  to  quit 
and  set  various  variables  Vj  back  to  0.  Each 
credit  will  stand  for  one  incoming  process,  or 
its  equivalent  in  terms  of  an  initial  0  in  some 

Lemma  4.4  >1  process  must  have  2^  credits  to 
reach  wait  until  V/  =  0,  and  credits  are  never 
created. 
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Proof:  To  show  that  mutual  exclusion  holds, 
an  accounting  system  of  credits  will  be  set  up 
where  a  process  must  collect  credits  to 

reach  ‘while  Vievei  ^  O’-  Every  set  of  actions  by 
one  or  more  processes  will  maintain  the  number 
of  credits  available,  or  reduce  the  number  of 
available  credits.  Reaching  the  critical  section 
is  equivalent  to  reaching  level  logn  +  1,  and 
requires  n  credits.  Each  process  starts  out  with 
1  credit,  since  every  process  can  reach  ‘while 
Vi  ^  O’,  for  n  credits  held  by  processes.  We 
assign  each  variable  Vievei  with  Vit^ei  =  0  2'*™®'“^ 
credits.  The  initial  0  values  for  these  variables 
equals  an  additional  n  —  1  credits,  for  a  total  of 
2n  —  1  credits  at  the  beginning  of  the  algorithm. 

Now  consider  all  possible  operations  by  pro¬ 
cesses; 

•  A  process  may  fail  to  execute  Vj  :=  Q 
when  it  is  otherwise  capable  of  doing  so. 
This  can  happen  when  a  process  is  slow 
to  check  Vj,  and  its  ID  is  overwritten  in 
the  meantime.  If  this  happens,  the  credits 
that  would  have  been  transferred  to  Vj  are 
lost.  Similarly,  if  a  process  writes  0  into 
a  variable  already  holding  0,  those  credits 
are  lost. 

•  Let  one  or  more  processes  notice  Vuvei  =  0, 
and  let  one  of  them  continue  to  the  next 
level.  By  Lemma  4.2,  the  processes  that 
don’t  go  to  the  next  level  can  only  execute 
Vievei  ■=  0,  transferring  to  the  2'®"®^“^ 
credits  they  have  by  getting  this  far.  Then 
take  the  2^®"®^”^  credits  from  the  initial  0 
value  of  the  variable,  and  cissign  them  to 
the  process  that  continues.  This  gives  it 
the  2*®"®*  credits  it  needs  to  continue  on¬ 
ward. 

•  Let  one  or  more  processes  notice  = 
0,  and  let  one  of  them  clear  some  or  all 
of  the  Vj’s.  Again  by  Lemma  4.2,  at 
most  one  process  Pi  can  clear  any  Vj  other 
than  Vieveh  As  above,  the  remaining  pro¬ 
cesses  can  transfer  their  credits  by  setting 
Vievei  :=  0.  Again  we  will  assign  the  cred¬ 


its  held  by  Vievei  to  P,-.  i^has  2^®^’®^  cred¬ 
its  available,  half  from  Vievei  ajid  half  from 
reaching  ‘while  Vievei  ^  O’.  Clearing  every 
variable  Vi  to  Vievei  accoimts  for  2^®"®*  —  1 
credits,  losing  one  (or  more)  credits.  So 
no  credits  can  be  created  by  a  process  that 
clears  multiple  variables. 

The  above  cases  account  for  all  possible  ac¬ 
tions  by  processes,  so  we  see  that  although  cred¬ 
its  can  be  lost,  they  cannot  be  created.  □ 

Lemma  4.5  If  a  process  declares  itself  elected 
(quick-method) ,  it  will  be  unique. 

Proof:  Suppose  Pi  declares  itself  elected  at 
level  1.  By  Lemma  4.3  there  cannot  be  an¬ 
other  process  at  this  level  since  Pi  found  Cj_2 
clear.  No  processes  can  get  past  level  I  in  the 
future  since  Pi  has  set  V/  to  a  nonzero  value, 
and  Vi  is  never  cleared  back  to  zero.  This  pre¬ 
vents  any  other  process  from  proceeding  past 
wait  until  VJ  =  0.  Also  no  other  process  Pj  can 
use  the  quick  exit  for  election,  since  this  same 
argument  works  when  we  exchange  Pi  and  Pj. 
If  Pj  could  declare  itself  elected,  then  P,-  cannot 
have  reached  its  current  position.  Therefore  if 
any  process  Pi  finds  C1-2  clear,  it  cein  declare 
itself  elected  safely. 

□ 

Theorem  4.1  The  symmetric  election  algo¬ 
rithm  in  Figure  2  maintains  mutual  exclusion, 
and  uses  inin(2[t'l  -1-  3, 2 [logn]  —  3)  variables  if 
t  <=  n  processes  participate,  and  O{logt)  op¬ 
erations  (exclusing  wait  loops). 

Proof:  Lemma  4.5  proves  mutual  exclusion 
when  a  process  uses  the  quick  exit.  Otherwise, 
Lemma  4.4  shows  that  treating  the  critical  sec¬ 
tion  as  level  [logn]  -f- 1,  at  least  n  credits  are 
necessary  for  a  process  to  reach  the  critical  sec¬ 
tion.  ff  two  processes  were  to  reach  the  critical 
section,  2n  credits  would  be  necessary.  Count¬ 
ing  the  n  credits  held  by  processes  and  n  —  1 
credits  from  the  initial  0  values  for  the  Vj  vari¬ 
ables,  there  are  2n  —  1  initial  credits.  However 
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n  credits  are  required  before  a  process  can  enter 
the  critical  section.  So  we  see  that  there  are  not 
enough  credits  available  for  two  processes  to  be 
in  the  critical  section  at  the  same  time,  proving 
mutual  exclusion. 

As  presented,  there  are  2  flog  n]  + 1  variables. 
Not  all  of  these  variables  are  necessary  for  the 
correct  and  efficient  behavior  of  the  algorithm. 
First,  the  values  of  Ciogn-2  through  Ciogn  are 
irrelevant.  The  quick  exit  is  not  necessary  here 
since  the  process  will  take  the  normal  exit  at  the 
end  of  this  level.  Also,  Ciogn_3  is  not  necessary. 
A  process  reading  Ciog„_3  is  at  level  log  n  —  1,  so 
it  can  simply  take  the  normal  exit,  modifying 
one  more  variable.  The  algorithm  treats  these 
variables  as  always  holding  1. 

Next  we  show  that  2  flog  f]  +  3  variables  are 
used  when  only  t  processes  are  active.  Let  m  = 
flogt].  There  are  2”  —  1  credits  in  Vi  through 
Vjnt  and  t  <  m  credits  for  initial  processes,  so 
there  are  2  •  2”*  —  1  credits  of  interest.  Credits 
stored  in  higher  variables  are  not  counted,  since 
they  are  either  held  by  some  process  or  returned 
to  that  variable.  To  get  two  processes  to  level  m 
requires  2  •  2"*  credits  by  Lemma  4.4.  Therefore 
any  process  checking  Vm  will  always  see  its  own 
ID,  and  Cm  will  remain  0.  The  discovery  that 
Cm  =  0  will  allow  a  process  at  level  m+2  to  take 
a  quick  exit.  So  only  turn,  Vi  through  Vm-\-2  and 
Cl  through  Cm  will  ever  be  referenced,  proving 
the  variable  bound. 

The  0(log  t)  operation  bound  is  a  simple  con¬ 
sequence  of  the  fact  that  processes  don’t  get 
past  level  flog  ,  and  the  variables  at  each  level 
(excluding  wait  loops)  2u:e  accesses  a  maximum 
of  seven  times.  This  counts  the  initial  examina¬ 
tion  of  Vieveh  JioV  subsequent  checks.  The 
check  for  clearing  Vj  is  included  in  the  cost  of 
level  j,  although  the  actual  access  may  take 
place  later.  □ 

We  can  modify  this  algorithm  to  use  fewer 
of  the  Cj  variables.  For  example,  suppose  we 
expect  that  usually  only  one  process  will  be  ac¬ 
tive.  Then  we  can  only  use  Ci,  with  C2,  Cz,  ■ .  • 
being  treated  as  if  they  always  held  the  value 
1.  In  general,  flogf]  -|-  1  C  variables  need  to 


be  defined  to  allow  a  quick  exit  for  t  processes. 
If  more  than  t  processes  show  up,  the  C's  will 
usually  be  set  to  one  2md  the  algorithm  degen¬ 
erates  to  a  regular  election  algorithm. 

5  Conclusion 

In  this  paper,  we  discover  that  the  fast  mutual 
exclusion  gap  between  0(1)  operations  if  a  pro¬ 
cess  is  alone  and  0(n)  if  it  is  not  does  not  need 
to  be  that  large.  The  bound  of  0{tlk)  opera¬ 
tions  (t  is  the  number  of  contending  processes, 
I  is  the  number  of  levels,  and  =  n  is  the 
number  of  processes)  is  reachable  for  mutual 
exclusion. 

For  one-time  election,  the  bound  of  O(logt) 
provides  a  very  smooth  function  that  is  directly 
related  to  the  number  of  contending  processes, 
with  no  jumps  or  discontinities.  This  is  the 
same  as  the  mutual  exclusion  algorithm  if  alone, 
but  better  for  small  numbers  of  contending  pro¬ 
cesses. 
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This  paper  obtains  the  first  deterministic 
sublinear-time  algorithm  for  network  decomposi¬ 
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an  in-depth  discussion  and  survey  of  all  existing 
definitions  of  network  decomposition.  We  also 
present  a  new  reduction  that  efficiently  trans¬ 
forms  a  weak-diameter  version  of  the  problem 
to  a  strong  one.  Thus  our  algorithm  speeds  up 
all  alternate  notions  of  network  decomposition. 
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we  obtain  the  first  fast  algorithm  for  construct¬ 
ing  a  sparse  neighborhood  cover  of  a  network, 
thereby  improving  the  distributed  preprocessing 
time  for  all-pairs  shortest  paths,  load  balancing, 
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1  Introduction 


Sparse  neighborhood  covers.  This  pa¬ 
per  is  concerned  with  fast  deterministic  al¬ 
gorithms  for  constructing  sparse  neighbor¬ 
hood  covers  in  the  distributed  network  model. 
Given  an  undirected  (weighted)  graph,  a 
neighborhood  cover  is  a  collection  of  sets  of 
nodes  (also  called  clusters)  which  cover  the 
neighborhoods  of  all  nodes  in  the  network.  A 
high-quality  or  sparse  cover  (see  Section  2) 
is  one  that  has  an  optimal  tradeoff  between 
the  diameter  of  each  cluster  and  the  cluster 
overlap  at  single  nodes. 

The  method  of  representing  networks  by 
sparse  neighborhood  covers  has  recently  been 
identified  [13, 12, 3, 15]  as  the  key  to  the  mod¬ 
ular  design  of  efficient  network  algorithms. 
Using  this  method  as  a  basic  building  block 
leads  to  dramatic  performance  improvements 
for  several  fundamental  network  control  prob¬ 
lems  (such  as  shortest  paths  [2],  job  schedul¬ 
ing  and  load  balancing  [10],  broadcast  and 
multicast  [4],  deadlock  prevention  [9],  band¬ 
width  mamagement  in  high-speed  networks 
[7],  and  database  management  [15]),  as  well 
as  for  classical  problems  in  sequential  com¬ 
puting  (such  as  finding  small  edge  cuts  in 
planar  graphs  [19]  and  approximate  all-pairs 
shortest  paths  [5]).  In  most  of  these  appli¬ 
cations,  sparse  neighborhood  covers  yield  the 
first  polylogarithmic-overhead  solution  to  the 
problem.  Thus,  in  a  sense,  the  impact  of  ef¬ 
ficient  sparse  neighborhood  cover  algorithms 
on  distributed  computing  is  analogous  to  the 
impact  of  efficient  data  structures  (like  bal¬ 
anced  search  trees  or  2-3  trees)  on  sequential 
computation. 
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Our  results.  This  paper  presents  the  first 
deterministic  sublinear-time  distributed  algo¬ 
rithm  (for  static  synchronous  networks)  that 
constructs  a  high-quality  network  decompo¬ 
sition.  The  algorithm  runs  in  time  0(n‘),  for 
any  e  >  0,  improving  on  the  best-known  de¬ 
terministic  running  time  of  0(n  log  n)  for  this 
problem  in  [13].  We  additionally  show  how 
to  efficiently  transform  the  weak-diameter 
version  of  the  problem  to  a  strong-diameter 
sparse  1-neighborhood  cover  (see  Section  2). 
We  get  the  analogous  improvement  for  t- 
neighborhood  covers,  where  all  algorithmic 
running  times  blow  up  by  a  factor  of  t.  The 
distributed  algorithm  can  be  adapted  to  a 
more  realistic  dynamic  asynchronous  envi¬ 
ronment  using  the  existing  transformer  tech¬ 
niques  in  [1,  13,  11].  The  applications  that 
use  sparse  covers  as  a  data  structure  often 
run  in  time  0(t),  where  i  •<  Diam(G)  <  n, 
in  which  case  our  improvement  is  particularly 
significant. 

Our  results  versus  existing  work.  In 
addition  to  obtaining  the  first  deterministic 
sublinear-time  algorithm  for  high-quality  net¬ 
work  decomposition,  we  emphasize  that  we 
additionally  produce  clusters  that  are  more 
useful  for  applications.  The  definition  of 
sparse  neighborhood  covers  considered  here 
is  equivalent  to  that  in  [13],  which  employs  a 
strong  notion  of  the  diameter  of  a  cluster  (see 
Section  2).  The  definition  in  [13]  is  related  to 
yet  distinct  from  the  notion  of  network  de¬ 
composition  defined  in  [8,  18,  16].  Network 
decomposition  as  utilized  in  [8,  18,  16,  17,  6] 
employs  only  a  weak  notion  of  low  diameter. 
This  means  that  the  network  decomposition 
clusters  might  not  even  be  connected  within 
the  clusters.  Thus  they  are  not  sufficient  to 
support,  for  example,  local  routing,  where  the 
path  between  two  nodes  in  the  same  cluster 
should  consist  entirely  of  nodes  within  that 
cluster. 

We  emphasize  that  the  new  fast  algorithms 
in  this  paper  speed  up  all  alternative  no¬ 
tions  of  network  decomposition,  including  the 
sparse  neighborhood  cover  definition  needed 
for  all  the  distributed  applications.  (See  Sec¬ 


tion  2  for  precise  definitions.) 

Other  work  on  related  problems.  The 
(weak  diameter,  low  quality)  notion  of  “net¬ 
work  decomposition”  was  first  defined  in 
[8,  18].  Awerbuch  et.  al.  [8]  gave  a  fast  al¬ 
gorithm  for  obtaining  0(n‘)-diameter  clus¬ 
ters,  for  any  t  >  0.  Their  algorithm  re¬ 
quires  0{n‘)  time  in  the  distributed  case,  and 
0(nE)  sequential  operations.  Unfortunately, 
the  construction  of  [8]  is  very  inefficient  in 
terms  of  the  quality  of  the  decomposition. 
Roughly  speaking,  the  inefficiency  factor  is 
0(n‘),  and  this  factor  carries  over  to  all  but 
some  of  the  graph-theoretic  applications,  ren¬ 
dering  the  decomposition  of  [8]  absolutely  un¬ 
acceptable  in  any  practical  context.  The  con¬ 
struction  of  [8]  is,  however,  sufficient  for  the 
two  main  applications  they  site  in  that  pa¬ 
per:  the  maximal  independent  set  problem 
and  (A  -1- 1)  coloring.  This  is  because  to  con¬ 
struct  a  MIS  or  a  (A  -I- 1)  coloring,  one  needs 
to  traverse  the  (?(n‘)-diameter  clusters  only 
a  constant  number  of  times.  The  network 
control  applications,  such  as  routing,  online 
tracking  of  mobile  users,  and  all-pairs  short¬ 
est  paths,  however,  need  to  traverse  the  clus¬ 
ters  many  times.  A  higher-quality  decompo¬ 
sition  is  needed  to  avoid  a  large  blowup  in  the 
running  time  for  these  latter  applications. 

For  the  remainder  of  this  paper,  when  we 
refer  to  network  decomposition,  we  mean  any 
of  the  formulations  of  high-quality  decompo¬ 
sition,  and  not  the  large  diameter,  large  num¬ 
ber  of  colors  obtained  by  [8]. 

Subsequent  to  our  work,  Pasconesi  and 
Srinivasan  [17]  slightly  reduced  the  run¬ 
ning  time  for  the  poor-quality  and  weak- 
diameter  construction  in  [8].  While  [8]  ob¬ 
tained  a  running  time  of  O(n'),  where  c  = 
0(\/loglog  n/\/log  n),  [17]  reduced  c  to  e  = 
O(l/Vlogn).  As  a  consequence  of  the  bet¬ 
ter  running  time  achieved  in  [17]  (smaller  t) 
and  the  techniques  in  this  paper,  the  running 
time  of  our  algorithm  for  high-quality  net¬ 
work  decomposition  can  be  slightly  improved 
(see  Corollary  3.4). 

The  randomized  algorithm  of  Linial  and 
Saks  [16]  achieves  a  high-quality  decomposi- 
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tion  by  introducing  randomization.  The  re¬ 
sults  are  stated  in  terms  of  a  weak  notion 
of  low-diameter  clusters,  but  can  be  modi¬ 
fied  to  produce  a  strong  diameter  decomposi¬ 
tion  as  well,  using,  for  example,  the  reduction 
techniques  in  this  paper  (or  [11]).  The  re¬ 
sulting  algorithm  is  efficient.  (We  comment 
that  the  distributed  algorithm  is  valid  only 
for  static  synchronous  networks,  but  the  ef¬ 
ficient  transformer  techniques  of  [11]  extends 
it  to  a  (more  realistic)  dynamic  asynchronous 
model.)  Since  we  are  concerned  here  with 
using  neighborhood  covers  as  a  data  struc¬ 
ture  and  running  various  applications  on  top, 
a  randomized  solution  is  not  acceptable  in 
many  cases.  We  need  a  fast  deterministic  al¬ 
gorithm  that  guarantees  a  good  underlying 
neighborhood  cover. 

2  Definitions 

Notions  of  network  decomposition. 
We  survey  the  different  formulations  of  net¬ 
work  decomposition,  and  discuss  their  rela¬ 
tions.  Within  each  family  of  definitions,  we 
also  discuss  what  it  means  to  have  a  high- 
quality  decomposition  or  cover,  in  terms  of 
the  optimal  tradeoffs  between  low  diameter 
and  sparsity.  The  sparse  neighborhood  cover 
formulation  is  the  one  that  is  useful  for  all  the 
applications.  We  stress  that  the  algorithms 
in  this  paper  achieve  all  alternative  notions 
of  network  decomposition. 

Definition  2.1  Consider  a  graph  G  whose 
vertices  appear  in  sets  Si,...,  Sr.  The 

weak  distance  between  u,v  €  Si,  denoted 
dista{u,v),  is  the  length  of  the  shortest  path 
between  u  and  v  in  G.  Namely,  the  path 
is  allowed  to  shortcut  through  vertices  not  in 
Si.  The  weak  diameter  of  5,-,  diam{Si)  = 
max„,„65^(dtstG(tt,  v)) 

Definition  2.2  Consider  a  graph  G  whose 
vertices  appear  in  sets  Si,...,  Sr-  The 

strong  distance  between  u, v  €  Si,  denoted 
dtst5.(u,  v),  is  the  length  of  the  shortest  path 
between  u  and  v,  on  the  induced  subgraph  5,-  of 
G.  Namely,  all  vertices  on  the  path  connecting 


u  and  V  are  also  in  Si.  The  strong  diameter  of 
Si,  Diam(Si)  =  max„,,g5.(dtstsi («,«)). 

The  square  of  a  graph,  G*,  is  defined  to 
be  the  graph  G  with  additional  edges  if  there 
exists  a  w  s.t.  (u,w)  and  (w,v)  are  in  G. 
Similarly,  G*  is  the  graph  with  an  edge  be¬ 
tween  any  two  vertices  that  are  connected 
by  a  path  of  length  <  t  in  G.  The  j- 
neighborhood  of  a  vertex  w  €  K  is  defined  as 
Nj(v)  =  {tn  I  disto{w,v)  <  j}.  Similarly,  the 
j-neighborhood  of  a  set,  V,  is  defined  to  be 
NjiV)  =  U„^vNi{v), 

We  are  now  ready  to  define  the  alternate 
notions  of  network  decomposition.  First,  we 
give  the  weak  diameter  definition. 

Definition  2.3  For  an  undirected  graph  G  = 
(y,  E),  a  (x,  d,  \)-decomposition  is  defined  to 
be  a  X'Coloring  of  the  nodes  of  the  graph  that 
satisfies  the  following  properties: 

1.  each  color  class  is  partitioned  into  an  arbi¬ 
trary  number  of  disjoint  clusters; 

2.  the  weak  diameter  of  any  cluster  of  a  sin¬ 
gle  color  class  is  at  most  d. 

3.  clusters  of  the  same  color  are  at  least  dis¬ 
tance  A-fl  apart. 

A  (x,  d,  A)'decomposttton  is  said  to  be  high- 
quality  if  when  d  =  O(kA),  x  is  at  most  kn^l^. 

We  make  several  remarks  about  the  Defini¬ 
tion  2.3,  which  is  equivalent  to  the  definitions 
in  [8,  16,  17]. 

•  The  high-quality  decomposition  as  de¬ 
fined  above  achieves  the  optimal  trade¬ 
off;  there  are  graphs  for  which  x  must  be 
n(fcn*/*)  to  achieve  a  decomposition  into 
clusters  of  diameter  bounded  by  0{kX) 
and  separation  A  [16]. 

•  When  A  =  1,  we  will  abbreviate  this  as  a 
(x,d)-decomposition.  TypicaJly,  we  are 
most  concerned  with  the  case  of  a  high- 
quality  decomposition  when  x  ^d  d  are 
both  O(logn).  This  optimal  decomposi¬ 
tion  tradeoff  is  not  achieved  in  the  clus¬ 
ters  of  [8, 17],  but  is  achieved  by  random¬ 
ized  methods  in  [16].  The  algorithms  in 
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this  paper  are  the  first  to  achieve  the  op¬ 
timal  tradeoff  deterministically  in  sub- 
linear  time. 

•  The  main  application  known  for  this 
structure  in  “symmetry- breaking”-  it 
can  be  used  to  construct  a  maximal  inde¬ 
pendent  set  or  a  A -fl  coloring  fast  in  the 
distributed  domain  [8,  16,  17].  In  this 
paper,  we  use  the  structure  for  symmetry 
breaking  as  follows:  the  recursive  algo¬ 
rithm  in  Section  3  constructs  a  (x,  d,  A)- 
decomposition  on  the  power  of  the  graph 
inside  the  recursion  first,  and  later  uses 
this  to  obtain  a  strong-diameter  decom¬ 
position. 

For  strong-diameter  network  decomposi¬ 
tion,  the  definition  is  the  same  as  Defini¬ 
tion  2.3,  except  in  Step  2,  substitute  strong 
for  weak  diameter.  As  with  the  weak  defi¬ 
nition,  the  “high-quality”  tradeoffs  are  opti¬ 
mal.  A  strong-diameter  (x,  d)-decomposition 
can  be  thought  of  as  a  generalization  of  the 
standard  graph  coloring  problem,  where  x  is 
the  number  of  colors  used,  and  the  clusters 
are  supernodes  of  diameter  d. 

We  now  present  the  definition  for  sparse 
neighborhood  covers.  Notice  that  this  is  a 
strong  diameter  definition. 

Definition  2.4  A  {k,t) -neighborhood  cover 
is  a  collection  of  sets  (also  called  clusters)  of 
nodes  Si, . . . ,  5,,  with  the  following  properties: 

1.  Vv,3i  s.t.  Nt{v)  C  Si,  where  A,(u)  = 
{u\dista(u,  v)  <  t}. 

2.  'ii,Diam{Si)  <  0{kt),  where 

Diam{Si)  =  max„,„g5,(di5t5;(u,  w)). 

A  (/l;,t)-neighborhood  cover  is  said  to  be 
sparse,  if  each  node  is  in  at  most  kn^l^  sets. 

Setting  A:  =  1,  the  set  of  all  balls  of  radius 
t  around  each  node  is  a  sparse  neighborhood 
cover.  Setting  k  =  Diam(G)/t,  the  graph 
G  itself  is  a  sparse  neighborhood  cover.  In 
the  first  case,  the  diameter  of  a  ball  is  t,  but 
each  node  could  appear  in  every  ball.  In  the 
second  case,  each  node  appears  only  in  G, 


but  the  diameter  of  G  could  be  as  high  as 
n.  Setting  k  =  logn  (the  typical  and  use¬ 
ful  setting,  for  all  the  applications),  a  sparse 
(log  n,  t)-neighborhood  cover  is  a  collection 
of  sets  Si  with  the  following  properties:  the 
sets  contain  all  f-neighborhoods,  the  diame¬ 
ter  of  the  sets  is  bounded  by  O(tlogTt),  and 
each  node  is  contained  in  at  most  clogn  sets, 
where  c  >  0.  We  remark  that  this  bound  is 
tight  to  within  a  constant  factor;  there  exist 
graphs  for  which  any  (log  n,  f)-neighborhood 
cover  places  some  node  in  at  least  n(logn) 
sets  [16].  When  k  =  log  n,  we  find  that  sparse 
neighborhood  covers  form  a  useful  data  struc¬ 
ture  to  locally  represent  the  f-neighborhoods 
of  a  graph. 

Our  new  fast  distributed  algo¬ 
rithm  achieves  deterministically  a  structure 
which  is  simultaneously  a  (strong,  and  there¬ 
fore  also  weak)  diameter  decomposition  and 
a  sparse  neighborhood  cover. 

3  Weak  Diameter  Network 
Decomposition 

In  this  section,  we  introduce  the  new  dis¬ 
tributed  algorithm  Color,  which  recursively 
builds  up  a  (A:n‘^*,2fc,  l)-decomposition.  It 
calls  on  a  procedure,  Create_Nefi_Color, 
which  runs  a  modified  version  of  the 
Awerbuch-Peleg  [14]  greedy  algorithm  on 
separate  clusters. 

Note  that  all  distances  in  the  discussion  be¬ 
low,  including  those  in  the  same  cluster,  are 
assumed  to  be  weak  distances,  and  the  diam¬ 
eter  of  the  clusters  is  always  in  terms  of  weak 
diameter  (see  Section  2). 

Color  is  implicitly  taking  higher  and 
higher  powers  of  the  graph,  where  recall  that 
we  define  the  graph  G*  to  be  the  graph  in 
which  an  edge  is  added  between  any  pair  of 
nodes  that  have  a  path  of  length  <  t  in  G. 
Notice  that  to  implement  the  graph  G*  in  a 
distributed  network  G,  since  the  only  edges 
in  the  network  are  still  the  edges  in  the  un¬ 
derlying  graph  G,  to  look  at  all  our  neigh¬ 
bors  in  the  graph  G‘,  we  might  have  to  tra¬ 
verse  paths  of  length  t.  Therefore  the  time 
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for  running  an  algorithm  on  the  graph  <?*, 
blows  up  by  a  factor  of  t.  The  crucial  ob¬ 
servation  is  that  a  (x,<f,  l)-decomposition  on 
G*  is  a  (x>  dt,  t)-decomposition  on  G.  Choos¬ 
ing  t  well  at  the  top  level  of  the  recursion, 
guarantees  that  nodes  in  different  clusters 
of  the  same  color  are  always  separated  by 
at  least  twice  the  maximum  possible  dis¬ 
tance  of  their  radii.  We  can  thus  use  pro¬ 
cedure  Greedy.Create.Color  to  in  parallel 
recolor  these  separate  clusters  without  colli¬ 
sions.  (The  leader  of  each  cluster  does  all  the 
computation  for  its  cluster.) 

The  recursive  algorithm  has  two  parts: 

1.  Find  a  (x,  df ,  ^-decomposition,  where 
X  =  xfcn^/*,  d,t  =  2k,  on  each  of  x  dis¬ 
joint  subgraphs. 

2.  Merge  these  together  by  recoloring,  as 

just  described,  to  get  a  ,2k,l)- 

decomposition. 

Algorithm:  Color(G) 

Input:  graph  G  =  {V,  E),  |V|  =  n,  and  inte¬ 
ger  fc  >  1. 

Output:  A  l)-decomposition  of 

G. 

1.  Compute 

2.  If  G  has  less  than  x  nodes,  run  the  Linial- 
Saks  [16]  or  Awerbuch-Peleg  [14]  sim¬ 
ple  greedy  algorithm  on  G^*,  and  go  to 
step  6. 

3.  Partition  nodes  of  G  into  x  subsets, 

(based  on  the  last  logx  bits 
of  node  IDs,  which  are  then  discarded). 

4.  Define  G,-  to  be  the  subgraph  of  G^*  in¬ 
duced  on  Vi. 

5.  In  parallel,  for  i,  Color(Gj). 

(every  node  of  G  is  now  colored  recur¬ 
sively) 

6.  For  each  v  £  V,  color  v  with  the  color 
<i,color(v)  €  G,>. 

(this  gives  an  xkn^l^  coloring  of  G  tvith 
separation  2k) 


7.  Do  sequentially,  for  t  =  1  to  kn^^^, 
CraateJlev_Color(G,  t) 

(this  gives  a  kn^^^  coloring  of  G  with  sep¬ 
aration  1) 

Algorithm:  CreateJleu-Color(G,  i) 

(this  colors  a  constant  fraction  of  the  old- 
colored  nodes  remaining  tvith  new  color  i ) 

Input:  graph  G  with  new  and  old  colored 
nodes  such  that  there  is  a  (xfcn‘/*,(2fc)^,2fe)- 
decomposition  on  the  old-colored  nodes  of  G 
and  a  (i  —  \,2k,  l)-decomposition  on  the  new- 
colored  nodes  of  G 

Output:  graph  G  with  new  and  old  colored 
nodes  such  that  there  is  a  (xjfcn'/*,(2fc)^,2fc)- 
decomposition  on  the  old-colored  nodes  of 
G  and  a  {i,2k,  l)-decomposition  on  the  new- 
colored  nodes  of  G 

2.  Do  sequentially,  for  j  =  1  to  xkn^l’‘, 
“Look  at  nodes  with  old  color  j”: 

(a)  Do  in  parallel  for  color  j  clusters, 

•  Elect  a  leader  for  each  cluster. 

•  The  leader  learns  the  identities, 
I/,  of  all  the  nodes  in  W  within 
k  distance  from  the  border  of 
its  cluster  (i.e.  this  is  graph  G 
for  that  cluster). 

•  The  leader  calls  procedure 
Greedy  _Create_Color(i2,  U), 
where  R  is  the  set  of  old- 
colored  j  nodes  in  both  the 
leader’s  cluster  and  in  W. 

•  Greedy-Create.Color  returns 
{DR,  DU).  The  leader  colors 
the  nodes  in  DR  with  new  color 
i,  and  sets  W  *-W  —  DU. 

Greedy.Create.Color  is  the  procedure  of 
the  Awerbuch-Peleg  [14]  greedy  algorithm 
that  determines  what  nodes  will  be  given  the 
current  new  color.  The  algorithm  identifies  a 
constant  fraction  of  the  nodes  in  the  cluster 
R  to  be  colored.  The  algorithm  picks  an  ar¬ 
bitrary  node  in  R  (call  it  a  center  node)  and 
greedily  grows  a  ball  around  it  of  minimum 
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radius  r,  such  that  a  constant  fraction  of  the 
nodes  in  the  ball  lie  in  the  interior  (i.e.  are 
in  the  baU  of  radius  r  -  1  around  the  center 
node).  It  is  easy  tc  prove  that  there  always 
exists  an  r  <  iblii|^/*  for  which  this  condition 
holds.  Note  that  although  the  centers  of  the 
balls  grown  out  are  always  picked  (arbitrar¬ 
ily)  from  the  nodes  in  R,  the  interiors  and 
borders  of  the  balls  which  are  then  claimed, 
include  any  of  the  nodes  in  V  (not  just  those 
in  R)  within  the  ball.  Then  another  arbitrary 
node  is  picked,  and  the  same  thing  is  done, 
until  all  nodes  in  R  have  been  processed.  Pro¬ 
cedure  Creata-Nev-Color  will  then  color  the 
interiors  of  the  balls  (set  DR)  with  new  color 
t,  and  remove  each  entire  ball  from  the  work¬ 
ing  graph  W. 

Algorithm:  Greedy_Create_Color(i2,  U) 

Input:  sets  of  nodes  R  and  U,  where  R  is  the 
set  of  nodes  in  the  cluster  and  £/  is  a  superset 
of  nodes  that  contains  R. 

Output:  {DR,  DU).  This  returns  a  constant 
fraction  of  the  nodes  in  R  in  set  DR  and  the 
1-neighborhoods  of  the  clusters  of  DR  in  set 
DU. 

1.  DR  *-9-,DU^  0. 

2.  While  ii  ^  0  do 


id  is  the  number  of  steps  per  iteration.  Over¬ 
all,  we  have 

T(n)  <  2kT{n/x)  +  x{kn^/'‘)\2k)^ 

<  (2jfc)“’«"/'°«‘x(ifcn‘/*)*(2)k)’ 

<  „2V^l+logt  /  y/i^+7/k  ^2fc)2 , 

when  X  =  2v^v'»+“«*.  □ 

Theorem  3.2  There  is  a  deterministic  dis¬ 
tributed  asynchronous  algorithm  which  given 
a  graph  G  =  {V,E),  finds  a  {kn^U‘ ,2k,\)- 
decom  position  of  G  in 

„2^i+iogt/v^+2/t  ^2kf  time. 

Corollary  3.3  There  is  a  deterministic  dis¬ 
tributed  asynchronous  algorithm  which  given 
G  —  {V,E),  finds  a  (0(logn),0(logn),  1)- 

decomposition  of  G  in  ( V'®* '“s ") 

time,  which  is  n‘  for  any  e  >  0.  We  remark 
that  the  constant  on  the  big-oh  in  the  running 
time  is  3. 

As  a  corollary  to  our  theorem  and  [17],  we 
can  obtain  a  slightly  better  running  time. 

Corollary  3.4  There  is  a  deterministic  dis¬ 
tributed  asynchronous  algorithm  which  given 
G  =  {V,E),  finds  a  (0(logn),0(logn),  1)- 
decomposition  of  G  in  0(n*^v^*°*")  time,  which 
is  n'  for  any  e  >  0. 


(a)  S  <—  {v}  for  some  v  ^  R. 

(b)  While  |iV,(5)  n  U\  >  do 

5  *-  S\J{Ni{S)nU). 

(c)  DR  *-  DR  U  5. 

(d)  DU  ^  DUU{Ni{S)nU). 

(e)  R*-R-S-{NiiS)nR). 

{{)  U  ^U-S. 

Lemma  3.1  If  i  =  the 

running  time  of  the  procedure  Color  is 

1  +log  kly/\ogn  +  2/k  (2it)^. 

Proof  The  branching  phase  ol  the  recursion 
takes  time  T'{n)  <  2kT'(n/x)+x.  The  merge 
takes  time  —  x(fcn‘/*)^(21:)^,  where 

is  the  number  of  iterations  overall  and 


4  Strong  Diameter  Network 
Decomposition 

The  algorithm  in  the  previous  section  pro¬ 
duced  a  weak-diameter  network  decomposi¬ 
tion.  While  this  is  a  nice  problem,  the  strong- 
diameter  form  is  the  one  we  want  in  order 
to  successfully  run  most  distributed  applica¬ 
tions.  In  this  section,  we  give  a  reduction 
that  given  a  weak  diameter  decomposition, 
constructs  a  structure  that  is  simultaneously 
both  a  strong  diameter  decomposition  and  a 
sparse  neighborhood  cover.  The  algorithm 
as  written  outputs  the  cover;  the  associated 
strong  decomposition  consists  of  the  interi¬ 
ors  of  the  clusters  in  the  sparse  neighborhood 
cover. 
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We  introduce  an  algorithm  Sparse,  which 
takes  as  input  a  procedure  Decomp,  which 
given  a  graph  G  =  {V,E),  finds  a 

l)-decomposition  of  G.  In  actual¬ 
ity,  we  will  bind  Decomp  to  procedure  Color 
of  Section  3.  Sparse  first  calls  procedure 
Decomp  with  G***.  Of  course,  this  will  yield 
an  0{kt)  blowup  in  the  running  time  of 
Decomp, say  r. 

Once  Decomp  is  called,  the  remaining  run¬ 
ning  time  for  Sparse  is  times 

a  t  blowup  for  traversing  ^-neighborhoods. 
Then,  in  sum,  Sparse  is  able  to  obtain  a  t- 
neighborhood  cover  in  the  original  graph  G 
in  time  0{ktT  +  k^tn?^^).  Recall  that  k  is 
typically  logn. 

Notice  that  the  code  for  Sparse  is  sim¬ 
ilar  to  the  last  pass  of  procedure  Color 
(Section  3);  however.  Sparse  has  an  ad¬ 
ditional  level  of  complexity.  To  obtain  a 
t-neighborhood  cover,  we  must  modify  the 
Awerbuch-Peleg  [13]  coarsening  algorithm, 
called  as  a  subroutine,  so  that  we  can  recolor 
dust'’  iraUel  without  interference. 

Notation.  In  the  algorithms  below,  we  use 
roman  capital  letters  for  names  of  sets,  and 
calligraphic  letters  for  names  of  collections  of 
sets.  In  particular,  corresponding  to  a  set 
W,  by  convention  we  will  denote  by  W  the 
collection  consisting  of  the  sets  {7V,(i7)|t;  € 
W}. 

Algorithm:  Sparse(G, Decomp) 

Input:  graph  G  =  (V,  £),  jVI  =  n,  and  in¬ 
teger  A;  >  1,  and  a  procedure  Decomp,  that 
finds  a  [kn}l^,2k,  l)-decomposition  of  G. 
Output:  T,  a  sparse  (k,  t)- neighbor  hood 
cover  of  G. 

1.  Decomp(G*“). 

{returns  a  {kn^!^ ,2k,\)-decomposition 
of  which  is  a  {kn^/’‘ ,\%kH,%kt)- 
decomposition  of  G.) 

(a)  T  ^  0. 

(T  is  the  cover.) 

(b)  Do  sequentially,  for  i  =  1  to  kn*/*, 
(find  a  kn^^^-degree  t-neighborhood 
cover  of  G.) 


i.  W  {W,(i;)|o  €  V}. 

(JA  is  the  collection  of  all  un¬ 
processed  t-neighborhoods.) 

ii.  Do  sequentially,  for  j  =  1  to 
kn^l^, 

“Look  at  nodes  with  old  color 

j”: 

A.  Do  in  parallel  for  color  j 
clusters, 

•  Elect  a  leader  for  each 
cluster. 

•  The  leader  learns  the 
identities,  of  all  the  t- 
neighborhoods  of  nodes 
within  a  4kt  distance  from 
the  border  of  its  cluster. 

•  The  leader  calls  proce¬ 
dure  Cover(72,W)  on  G, 
where  Ti  is  the  coUection 
of  t-neighborhoods  of  old- 
colored  j  nodes  in  both 
the  leader’s  cluster  and  in 
U. 

•  Cover  returns  {VTi^VU). 
The  leader  colors  the 
nodes  in  VH  with  new 
color  »,  and  sets  U  *—U  — 
VU. 

iii.  T  4-  r  U  x>7e 

Cover  is  our  modification  of  the  Awerbuch- 
Peleg  [13]  coarsening  algorithm  that  deter¬ 
mines  what  nodes  will  be  given  the  current 
new  color.  The  actual  code  for  this  proce¬ 
dure  follows  a  description  of  the  algorithm 
below.  The  key  to  our  fast  simulation  of  their 
coarsening  algorithm,  is  that  we  keep  track 
of  neighborhoods  within  and  outside  of  the 
old-colored  j  clusters  separately,  in  order  to 
recolor  clusters  in  parallel  without  collisions. 

Procedure  Cover(7J,W)  operates  in  itera¬ 
tions.  E^h  iteration  constructs  one  output 
cluster  y  G  PT,  by  merging  together  some 
clusters  of  U.  The  iteration  begins  by  arbi¬ 
trarily  picking  a  cluster  5  in  2/  D  72  and  des¬ 
ignating  it  as  the  kernel  of  a  cluster  to  be 
constructed  next.  The  cluster  is  then  repeat¬ 
edly  merged  with  intersecting  clusters  from 
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U.  This  is  done  in  a  layered  fashion,  adding 
one  layer  at  a  time.  At  each  stage,  the  origi¬ 
nal  cluster  is  viewed  as  the  internal  kernel  Y 
of  the  resulting  cluster  Z.  The  mer^ng  pro¬ 
cess  is  carried  repeatedly  until  reaching  a  cer¬ 
tain  sparsity  condition  (specifically,  until  the 
next  iteration  increases  the  number  of  clus¬ 
ters  merged  into  Z  by  a  factor  of  less  than 
The  procedure  then  adds  the  kernel 
y  of  the  resulting  cluster  Z  to  a  collection 
VT.  It  is  important  to  note  that  the  newly 
formed  cluster  consists  of  only  the  kernel  T, 
and  not  the  entire  cluster  Z,  which  contains 
an  additional  "external  layer”  of  Tl  clusters. 
The  role  of  this  external  layer  is  to  act  as  a 
“protective  barrier”  shielding  the  generated 
cluster  y,  and  providing  the  desired  disjoint¬ 
ness  between  the  different  clusters  Y  added 

to  vr. 

Throughout  the  process,  the  procedure 
keeps  also  the  “unmerged”  collections  y,Z 
containing  the  original  H  clusters  merged  into 
y  and  Z.  At  the  end  of  the  iterative  process, 
when  y  is  completed,  every  duster  in  the  col¬ 
lection  y  is  added  to  D7J,  and  every  cluster 
in  the  collection  Z  is  removed  from  U.  Then 
a  new  iteration  is  started.  These  iterations 
proceed  until  U  C\Tl\s  exhausted.  The  pro¬ 
cedure  then  outputs  the  sets  VR,  and  VT. 

Procedure  Cover  is  formally  described  in 
Figure  1.  Its  properties  are  summarized  by 
the  following  lemma.  We  comment  that  our 
modifications  do  not  change  the  lemma. 

Lemma  4.1  ([13])  Given  a  graph  G  = 
(VfE),  \V\  =  n,  a  collection  of  clusters  R  and 
an  integer  k,  the  collections  VT  and  VIZ  con¬ 
structed  by  Procedure  Cover(7Z,  2/)  operates  in 
iterations,  satisfy  the  following  properties; 

(1)  All  clusters  in  V7Z  have  their  t- 
neighborhood  contained  in  some  cluster  in 
VT. 

(2)  y  n  y'  =  0  for  every  y,  y'  6  VT. 

(3)  \vn\  >  and 

(4)  maxTenr  Diam(T) 

<  {2k  —  1)  maxRgti  Diam(ff). 


VT  —  9  ;  THl*-9 
repeat 

Select  an  arbitrary  cluster  S  nR. 

Z*-{S] 

repeat 

y^z 

y-UseyS 

z^{s\seu,  5ny #0}. 
unta  \z\  < 
u  ^u-z 
DT  — ©TU{y} 

untU  UC\TZ-=9 
Output  (2>7J ,  VT'). 

Figure  1:  Procedure  Cov«r(7t,t/). 


Theorem  4.2  There  is  a  deterministic  dis¬ 
tributed  algorithm,  e.g.  Sparse(f7, Color), 
that  given  a  graph  G  =  {V,  E),  IV)  =  n,  and 
integers  k,t  >  1,  constructs  a  t-neighborhood 

cover  of  G  in  time  in  the 

asynchronous  model,  where  each  node  is  in  at 
most  0(kn*^*)  clusters,  and  the  maximum  clus¬ 
ter  diameter  is  0{kt). 

Finally,  we  remark  that  if  we  color  only 
the  interiors  of  the  new  color  i  clusters,  the 
above  construction  produces  a  strong  diame¬ 
ter  high-quality  network  decomposition  from 
a  weak  diameter  high-quality  network  decom¬ 
position,  as  well  as  a  sparse  neighborhood 
cover.  This  is  because  our  construction  is 
such  that  each  node  in  the  cover  lies  in  pre¬ 
cisely  one  new-colored  interior. 
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Abstract:  This  paper  presents  protocols  for 
leader  election  in  complete  networks.  The  proto- 
cob  are  message  optimal  and  their  time  complex¬ 
ities  are  a  significant  improvement  over  currently 
known  protocob  for  this  problem.  For  asyn¬ 
chronous  complete  networks  with  sense  of  direc¬ 
tion,  we  propose  a  protocol  which  requires  0(N) 
messages  and  0(logN)  time.  For  asynchronous 
complete  network  without  sense  of  direction,  we 
show  that  Q(N/logN)  is  a  lower  bound  on  the 
time  complexity  of  any  message  optimal  elec¬ 
tion  protocol  and  we  present  a  family  of  proto¬ 
cob  which  requires  0(Nk)  messages  and  0{  'V  /  V) 
time,  logN  <  k  <  N.  Our  results  also  improve 
the  time  complexity  of  several  other  related  prob¬ 
lems  such  as  spanning  tree  construction,  comput¬ 
ing  a  global  function,  etc. 

1  Introduction 

In  the  leader  election  problem,  there  are  N  pro¬ 
cessors  in  the  network,  each  having  a  unique 
identity.  Initially  all  nodes  are  passive.  An  ar¬ 
bitrary  subset  of  nodes,  called  the  base  nodes, 
wake  up  spontaneously  and  start  the  protocol. 
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On  the  termination  of  the  protocol,  exactly  one 
node  announces  itself  the  leader.  In  this  paper, 
we  consider  the  problem  of  electing  a  leader  in 
an  asynchronous  complete  network.  In  a  com¬ 
plete  network,  each  pair  of  nodes  is  connected  by 
a  bidirectional  link  and  we  assume  that  a  node 
is  initially  unaware  of  the  identity  of  any  other 
node. 

Leader  election  is  a  fundamental  problem  in 
distributed  computing  and  has  been  studied  in 
various  computation  modeb.  For  complete  net¬ 
works  in  which  a  node  is  unable  to  distinguish 
between  its  incident  links,  [KMZ84]  showed  that 
Q{NlogN)  messages  are  required  for  electing  a 
leader.  However,  [LMW86]  showed  that  the 
lower  bound  of  Sl{NlogN)  messages  does  not 
hold  for  complete  networks  with  sense  of  direc¬ 
tion  and  gave  a  protocol  which  requires  0{N) 
messages.  A  network  has  a  sense  of  direc¬ 
tion  if  there  exists  a  directed  Hamiltonion  cycle 
and  each  edge  incident  at  any  node  t  is  labeled 
with  the  distance  of  the  node  at  the  other  end 
along  this  Hamiltonion  cycle.  Figure  1  shows 
a  complete  network  contsdning  six  nodes  with  a 
sense  of  direction.  [ALSZ89]  further  showed  that 
0{logN)  chords  in  a  ring  network  are  sufficient 
to  obtain  a  protocol  with  0{N)  message  com¬ 
plexity.  These  two  extreme  cases,  one  in  which 
a  node  is  unable  to  distinguish  between  any  two 
incident  edges  and  the  other  in  which  all  edges 
are  labeled  with  a  distinct  number,  show  the  im¬ 
pact  of  knowledge  of  topolo^cal  information  on 
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the  complexity  of  leader  election. 

The  time  complexity  of  the  protocol  in 
[LMW86]  is  0(JV),  In  this  protocol,  a  node  cap¬ 
tures  a  majority  of  nodes  before  it  declares  itself 
as  the  leader.  We  observe  that  in  such  networks, 
a  node  does  not  have  to  capture  a  majority  of 
nodes  in  order  to  be  elected  the  leader.  We  use 
this  idea  to  obtain  a  simple  protocol  which  re¬ 
quires  0{N)  messages.  However,  due  to  conges¬ 
tion  on  the  links  and  a  specific  wake  up  pattern 
of  nodes,  its  time  complexity  is  not  O(logN).  We 
then  modify  this  protocol  to  solve  these  problems 
and  obtain  a  protocol  which  requires  0{N)  mes¬ 
sages  and  0{logN)  time. 

We  also  propose  an  improved  protocol  for 
leader  election  in  asynchronous  complete  net¬ 
work  without  sense  of  direction  (in  the  rest  of  the 
paper,  unless  other  stated,  we  will  use  ‘complete 
network’  to  mean  ‘complete  network  without 
sense  of  direction’).  [KMZ84]  proposed  a  pro¬ 
tocol  for  this  problem  which  requires  0{NlogN) 
messages  and  O(NlogN)  time.  (AG85]  gave  a  se¬ 
ries  of  simple  message  optimal  protocols  for  com¬ 
plete  networks,  each  with  0(N)  time  complex¬ 
ity.  Furthermore,  it  was  conjectured  in  [AG85] 
that  il(N)  is  a  lower  bound  on  the  time  com¬ 
plexity  of  any  message  optimal  election  protocol 
for  asynchronous  complete  networks.  We  prove 
that  Q{N/logN)  is  a  lower  bound  on  the  time 
complexity  of  any  message  optimal  protocol  for 
this  problem.  This  proves  that  introducing  asyn¬ 
chrony  may  result  in  a  loss  in  speed  by  a  fac¬ 
tor  of  N/{logN)^.  A  similar  result  was  shown 
in  [AFL83]  where  a  particular  asynchronous  sys¬ 
tem  was  shown  to  be  slower  by  a  factor  of 
logN  than  the  corresponding  synchronous  sys¬ 
tem.  We  also  provide  a  message  optimal  pro¬ 
tocol  for  asynchronous  complete  networks  which 
requires  0{N/logN)  time.  The  protocol  involves 
a  new  technique  which  allows  us  to  distinguish 
between  nodes  that  wake  up  at  different  times 


to  participate  in  the  protocol.  The  complexity 
of  the  protocol  depends  on  the  number  of  base 
nodes  and  we  show  that  the  time  complexity 
can  be  improved  to  0{logN  +  min{r,N/logN)), 
where  r  is  the  number  of  base  nodes.  We  also 
present  a  protocol  tolerant  to  /  initial  site  fail¬ 
ures  which  requires  0(JV/  +  NlogN)  messages 
and  0{N/logN)  time,  where  /  <  N/2. 


- -  The  directed  Hamiltonian  cycle 

Figure  1:  A  complete  network  with  a  sense  of 
direction 


There  are  many  problems  such  as  spanning 
tree  construction,  computing  a  global  function, 
etc.  which  are  equivalent  to  leader  election  in 
terms  of  message  and  time  complexities.  Our 
protocols,  therefore,  leads  to  improvement  in  the 
time  complexity  of  these  problems  as  well. 

This  paper  is  organized  as  follows.  In  the  next 
section,  we  present  our  model  of  distributed  com¬ 
putation.  In  Section  3,  we  present  a  protocol  for 
leader  election  in  a  complete  network  with  sense 
of  direction.  In  Section  4,  we  present  a  protocol 
for  leader  election  in  a  complete  network  with¬ 
out  sense  of  direction  and  we  show  a  lower  bound 
on  the  time  complexity  of  any  message  optimal 
leader  election  protocol. 
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2  Model 

We  model  the  communication  network  as  a  com¬ 
plete  graph  {N,E),  where  N  and  E  represent 
the  processors  and  communication  links  respec¬ 
tively.  We  assume  that  each  node  has  a  unique 
identity.  Messages  sent  over  a  link  arrive  at  their 
destination  within  finite  but  unpredictable  time 
and  in  the  order  sent  and  are  not  lost.  Each 
message  may  carry  0{logN)  bits  of  information. 
The  message  complexity  of  a  protocol  is  the  max¬ 
imum  number  of  messages  sent  during  any  pos¬ 
sible  execution  of  the  protocol.  The  time  com¬ 
plexity  of  a  protocol  is  the  worst  case  execution 
time  assuming  that  each  message  takes  at  most 
one  time  unit  to  reach  its  destination  and  com¬ 
putation  time  is  negligible.  Furthermore,  inter¬ 
message  delay  on  a  link  is  at  most  one  time  unit. 
All  additions  in  the  paper  are  assumed  modulo 
N. 

3  Complete  Networks  with 
sense  of  direction 

For  complete  networks  with  sense  of  direction, 
[LMW86]  proposed  a  leader  election  protocol 
which  requires  0{N)  messages  and  0{N)  time. 
Let  i[d]  denote  the  node  at  distance  d  from  i  and 
i[x..y]  denote  the  set  {z[x],i[i-|-  1], . .  In 

[LMW86],  if  node  i  is  able  to  capture  nodes  in 
z[l..iV/2]  then  it  can  declare  itself  as  the  leader. 
We  observe  that  in  the  presence  of  a  sense  of 
direction,  a  node  does  not  have  to  capture  a 
majority  of  nodes.  For  example,  if  i  captures 
all  nodes  in  {z[l..Ar/4],i[W/2],z[3W/4]}  then  it 
can  declare  itself  as  the  leader.  By  capturing 
i[iV/2],  for  example,  i  ensures  that  no  node  in 
i[iV/4  -f-  l..iv/2]  will  be  able  to  become  a  leader 
since  a  node  in  this  set  must  capture  z[Ar/2]  to  be¬ 
come  a  leader.  In  particular,  a  node  can  declare 


itself  as  the  leader  after  capturing  the  nodes  in 
the  set  {*■[!.. fc],  t[2A;],  t[3A;], . . . ,  i[N-k]}.  We  com¬ 
bine  this  idea  with  those  in  [LMW86]  to  obtain 
a  new  protocol.  A,  which  is  as  follows: 

Protocol  A:  This  protocol  proceeds  in  two 
phases: 

First  Phase:  On  waking  up  spontaneously,  a  base 
node  i  tries  to  capture  nodes  in  Si  =  t[l..l;]  in  a 
sequential  fashion.  A  passive  node  wakes  up  on 
receiving  a  message  of  the  protocol.  A  passive 
node  is  not  allowed  to  become  a  base  node  if  it 
wakes  up  on  receiving  the  message  of  the  proto¬ 
col.  A  base  node  i  uses  its  identity  and  level,, 
which  denotes  the  number  of  nodes  which  i  has 
captured  so  far,  to  contest  with  other  nodes. 
When  a  base  node  i  wakes  up,  it  sends  a  message 
capture(i,leveli)  to  t[l].  When  a  node  j  receives 
a  capture(i,  1)  messages,  it  behaves  as  follows: 

•  If  j  is  not  a  base  node  or  it  has  been  captured 
then  it  responds  with  an  accept{0)  message. 

•  If  7  is  a  base  node  which  has  not  yet 
been  captured,  and  [level j,j)  <  [l,i)  then 
again,  i  captures  j  and  j  responds  with 
accept[levelj).  Otherwise,  j  ignores  the 
message. 

If  i  receives  accept[l),  it  adds  /  -f-  1  to  leveli 
(and  therefore  the  set  of  captured  nodes  is  ex¬ 
tended  to  include  the  nodes  captured  by  j).  If 
level,  <  k  then  it  continues  its  conquest  by  send¬ 
ing  a  capture  message  to  t[/eve/,  -|- 1].  Otherwise, 
it  enters  the  second  phase. 

Second  Phase:  On  entering  this  phase,  i  sets 
owneTi  to  i  and  sends  a  message,  owner[i),  to 
each  node  j  in  i[l..fc].  On  receiving  this  mes¬ 
sage,  j  sets  owner-linkj  to  denote  the  link  from 
j  to  i  and  owner  j  to  i.  Furthermore,  it  sends 
an  acknowledgement  message  to  i.  After  re¬ 
ceiving  an  acknowledgement  from  all  nodes  in 
t[l..A;],  i  sends  an  elect[i)  message  to  each  node 
in  {i[2fc], . . . ,  On  receiving  el€ct{i),  site 
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j  behaves  as  follows;  If  (owner j  has  not  been  set) 
or  (owner j  has  been  set  and  owner  j  <  i)  then 
it  sets  owner  j  to  t  and  sends  an  accept  message 
to  t.  Otherwise,  it  ignores  the  message.  If  i  re¬ 
ceives  all  accept  responses  then  it  declares  itself 
the  leader. 

In  the  first  phase,  each  node  is  captured 
at  most  once  and  each  capturing  requires  one 
capture  message  and  one  accept  message.  Hence, 
the  first  phase  will  require  0(N)  messages.  In 
the  second  phase,  there  can  be  at  most  0(N/k) 
candidates.  Each  candidate  sends  messages  to 
capture  N/k  nodes.  Hence,  the  total  number 
of  messages  in  the  second  phase  is  0(N^/k^). 
The  message  complexity  of  A  is  therefore  0(N  + 
N'^/k'^).  In  particular,  for  k  >  y/N,  the  protocol 
requires  0(N)  messages.  We  will  now  compute 
the  time  complexity  of  A.  The  execution  of  the 
second  phase  takes  0(1)  time.  Furthermore,  if  a 
node  is  successful  in  capturing  another  node  then 
it  does  in  a  constant  amount  of  time.  Hence, 
the  node  which  is  elected  the  leader  will  finish 
its  first  phase  within  0(k}  time  units  of  waking 
up.  However,  the  following  situation  can  arise: 
Assume  that  nodes  have  Identities  1, . . . ,  AT  such 
that  t[l]  =  i  -I-  1.  Let  1  be  the  first  node  to 
wake  up.  After  waking  up  spontaneously,  node 
i  sends  a  capture  message  to  node  »  +  1.  i  -I-  1 
wakes  up  just  before  the  message  from  t  reaches 
it  and  sends  the  message  to  capture  i  -f  2  before 
receiving  the  message  from  i.  In  this  case,  no 
response  will  be  sent  to  i  since  i  has  the  same 
level  number  as  i+l  but  a  smaller  identity.  If 
this  happens  for  all  sites  i,  1  <  i  <  JV,  then  only 
node  N  will  survive  and  capture  all  other  nodes. 
If  the  capture  message  for  each  node  takes  ex¬ 
actly  one  time  unit  to  arrive,  node  N  will  wake 
up  at  time  N  —  1  and  therefore  the  protocol  will 
require  0(JV)  time  units.  However,  i/  all  nodes 
wake  up  within  0(k)  time  of  each  other  then  the 
first  phase  will  take  0(k)  time.  We  will  now  mod¬ 
ify  the  protocol  A  to  obtain  A'  as  follows:  After 


a  node  i  wakes  up  (either  spontaneously  or  on  re¬ 
ceiving  a  message),  it  sends  a  message  to  awaken 
»[1]  and  i[A:].  Hence,  within  0(k  -h  N/k)  time,  all 
nodes  will  wake  up  to  participate  in  the  protocol. 
Therefore,  the  time  complexity  of  the  protocol  is 
0(k  +  N/k).  In  particular,  for  k  =  y/N,  the  time 
complexity  of  A'  is  0(y/N). 

We  will  now  extend  this  idea  to  obtain  a  pro¬ 
tocol  which  requires  O(logN)  time.  Consider  the 
following  protocol,  B,  which  is  an  asynchronous 
version  of  the  synchronous  protocol  in  [AG85]. 
For  simplicity,  assume  that  N  is  of  the  form  2®. 
In  this  protocol,  a  candidate  node  t  tries  to  cap¬ 
ture  all  other  nodes  in  logN  steps.  In  the  first 
step,  t  sends  a  message  to  capture  t[iV/2].  In  the 
step,  i  sends  a  message  to  capture  2'— 1  nodes 
in  the  set  t[JV/2'],  i[3JV/2'], . . .  ,i[(2'-l)iV/2'].  H 
i[Nj2]  is  also  a  base  node  then  it  will  send  a  mes¬ 
sage  to  capture  i  in  its  first  step.  Hence,  only 
one  of  i  and  i[JV/2]  will  proceed  to  step  2.  Sim¬ 
ilarly,  only  one  of  i,  i[JV/4],  »[iV/2]  and  t[3iV/4] 
will  proceed  to  step  3  and  so  on.  Although  the 
time  complexity  of  this  protocol  is  O(logN),  its 
message  complexity  is  O(NlogN). 

We  will  now  combine  ideas  in  A  and  B  to  ob¬ 
tain  a  protocol  C  which  has  0(N)  message  com¬ 
plexity  and  0(logN)  time  complexity.  C  pro¬ 
ceeds  in  two  phases.  In  the  first  phase,  we  use 
A  to  first  reduce  the  number  of  candidates  to  at 
most  N / log N.  The  second  phase  employs  B  to 
elect  the  leader.  Let  k  = 

First  Phase:  In  this  phase,  t  tries  to  capture 
»[fc],»[2A:],.  ..,i[A'^  — A:]  in  a  sequential  manner. 
Observe  that  when  i[xA:]  wakes  up,  it  will  try 
to  capture  the  same  set  of  nodes  in  the  order 
*[a:A;  +  A:], . . . ,  »[xA;  -I-  iV  -  A:].  Hence,  in  the  first 
phase,  nodes  in  this  set  compete  against  each 
other.  The  rules  for  capturing  are  the  same  as 
in  the  first  phase  of  A.  Hence,  for  example,  if 
i[xA:]  has  already  captured  »[(*  -I-  1)A;]  when  i 
captures  j[xA;],  then  »[xA:]  surrenders  t[(x  -|-  1)A:] 
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to  i  and  therefore,  t  can  extend  its  set  of  cap¬ 
tured  nodes  to  include  this  node  also.  Using 
i  as  the  reference  node,  the  nodes  can  be  par¬ 
titioned  into  k  sets,  Ro,..,,Rk-i,  where  Rj  = 
{t[i  +  A],  i[j  -i-  2^], .  ..,*[?  +  N-k].  After  the  first 
phase,  we  have  at  most  one  alive  candidate  from 
each  of  these  sets. 

Second  Phase:  On  entering  this  phase,  node  i 
sends  messages  to  each  node  j  in  Ri  to  update 
owner j  to  t.  In  this  phase,  we  have  at  most  one 
candidate  in  each  set  Rj,  and  we  have  to  elect 
a  leader  among  them.  Let  i  be  the  candidate  in 
Rq.  In  order  to  defeat  other  candidates,  i  tries 
to  capture  the  nodes  in  the  set  *[!..*- 1]  in  logk 
steps.  This  phase  is  an  asynchronous  version  of 
the  synchronous  protocol  in  [AG85].  In  the  first 
step,  i  sends  an  elect(i,0)  message  to  capture 
t[A:/2].  In  the  step,  it  sends  2*“^  messages  to 
capture  *[fc/2'],  i[3A:/2^], . . . ,  i[(2'  -  l)fc/2'].  Ob¬ 
serve  that  if  there  is  an  alive  candidate  in  Rk/2 
then  it  wUl  send  a  message  to  capture  a  node  in 
Ro  in  its  first  step.  Hence,  only  one  node  from 
Ro  and  Rif/2  wiU  go  to  step  2.  In  step  2,  since  a 
node  sends  messages  to  capture  nodes  at  distance 
k/4  and  ik/4,  only  one  node  from  Rq,  Rk/4,  Rk/2 
and  R3k/4  will  survive.  In  general,  after  the 
step,  there  will  be  at  most  k/2^  alive  candidates 
(note  that  k  is  0(N/logN)).  In  this  phase,  to 
contest  with  other  nodes,  i  uses  its  identity  and 
stepi,  which  indicates  the  number  of  steps  which 
i  has  executed  so  far.  Consider  the  case  in  which 
a  node,  i,  in  R^  sends  a  message  to  capture  a 
node,  j,  in  Ry.  If  j  is  passive  then  there  is  no 
base  node  in  Ry  which  has  captured  all  nodes  in 
Ry  and  therefore,  j  sends  an  accept  message  to 
t.  If  there  is  a  candidate  in  Ry  then  the  message 
is  forwarded  to  that  node  {owner-linkj  will  be 
the  edge  leading  to  this  node)  and  they  compete 
on  the  basis  of  {step,  id).  However,  if  i's  mes¬ 
sage  reaches  the  candidate  in  Ry  and  finds  that 
this  node  has  already  been  captured  then  i  must 
first  kill  the  owner  of  iZy’s  candidate  before  it 


can  claim  Ry's  candidate.  For  this  purpose,  the 
message  is  forwarded  to  that  node  (thus,  each 
message  can  be  forwarded  at  most  twice).  For 
example,  if  t  in  Rkj^  sends  a  message  to  cap¬ 
ture  j  in  R3kl4i  ^d  the  base  node  in  R3k/4  has 
already  been  captured  by  Rk/4  in  step  1  then  i 
must  defeat  the  base  node  in  Rk/4  before  claim¬ 
ing  R3k/4- 

The  first  phase  requires  0{logN)  time  since  a 
node  competes  only  with  0{logN)  other  nodes. 
The  second  phase  involves  0{logN)  steps,  each 
of  which  will  take  a  constant  amount  of  time. 
Hence,  the  protocol  requires  0{logN)  time.  We 
will  now  compute  the  message  complexity  of 
C.  The  first  phase  requires  0{N)  messages 
since  a  node  is  captured  at  most  once.  In  the 
second  phase,  there  can  be  at  most  k  candi¬ 
dates.  Furthermore,  there  can  be  at  most  k/2^~^ 
nodes  in  step  I  (since  a  node  in  step  I  must 
have  captured  2^~^  nodes  and  sets  of  nodes  cap¬ 
tured  by  different  sites  are  disjoint).  A  node 
in  step  I  sends  2*”^  messages  to  capture  nodes. 
Each  of  these  messages  generate  a  constant 
number  of  messages.  Since  k  =  0{N/logN), 
the  total  number  of  messages  generated  in  the 
second  phase  is  *  0(2'-'^)) 

=  i:i<i<iogNiO{N/{log~N  *  2'-!))  ♦  0(2'-!))  < 

0{logN  *  N/logN)  =  0{N).  Hence,  the  message 
complexity  of  the  protocol  is  0{N). 

4  Complete  Networks  without 
sense  of  direction 

In  this  section,  we  will  present  a  family  of  algo¬ 
rithms  for  leader  election  in  complete  networks 
without  sense  of  direction.  Protocols  belong¬ 
ing  to  this  family  require  0{Nk)  messages  and 
0{N/k)  time,  where  logN  <  k  <  N  [Si91].  We 
win  first  present  two  different  algorithms,  V  and 
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£,  for  leader  election.  We  will  then  combine  fea¬ 
tures  of  these  algorithms  to  obtain  the  final  pro¬ 
tocol  T. 

Protocol  T>:  In  this  algorithm,  a  base  node  at¬ 
tempts  to  capture  all  other  nodes  in  parallel.  On 
waking  up  spontaneously,  a  base  node  sends  its 
identity  in  an  elect  message  on  all  incident  edges. 
When  a  node  j  receives  an  elect{i)  message  over 
edge  e,  it  behaves  as  follows:  If  j  is  a  base  node 
and  j  >  i  then  no  response  is  sent  over  e;  Other¬ 
wise,  j  sends  an  accept  message  over  e.  A  node 
that  receives  an  accept  message  on  all  incident 
edges  declares  itself  the  leader.  The  time  com¬ 
plexity  of  this  protocol  is  0(1).  However,  its 
message  complexity  is  O(iV^)  since  the  number 
of  base  nodes  may  be  0{N),  each  of  which  will 
send  0(N)  messages. 

Protocol  £:  £  is  a  modification  of  the  protocol  A 
in  [AG85].  The  outline  of  protocol  A  in  [AG85] 
is  as  follows: 

A  base  node  tries  to  capture  other  nodes  in 
a  sequential  manner  by  sending  capture  mes¬ 
sages  on  its  incident  edges  one  at  a  time.  A 
node  that  is  successful  in  capturing  all  other 
nodes  is  elected  the  leader.  A  base  node  * 
sends  its  identity  and  a  variable,  levels,  in  the 
capture  message  to  contest  with  other  nodes 
{leveli  is  the  number  of  nodes  which  i  has 
captured  so  far).  If  a  capture  message  from 
i  reaches  a  node  j  which  has  not  yet  been 
captured,  and  (/eve/,-,  i)  >  (/eve/j,j)  (lexico¬ 
graphically)  then  i  captures  j,  otherwise  i  is 
killed.  If  y  is  a  captured  node,  then  /  has 
to  kill  j's  owner  before  claiming  j.  If  i  is 
successful  in  capturing  j  then  it  increments 
leveli  and  proceed  with  its  conquest  by  send¬ 
ing  a  capture  message  to  another  node. 

The  message  complexity  of  A  is  0{NlogN). 
Although  the  time  complexity  of  A  is  0{N),  it 
does  not  possess  the  property  that  a  node  is  able 
to  capture  another  node  in  a  constant  amount  of 


time.  For  example,  a  captured  node  j  may  re¬ 
ceive  capture  messages  from  nodes  t'l,  12, 
in  the  order  given  and  forward  each  of  these  mes¬ 
sages  to  its  owner.  In  particular,  if  it  forwards 
0(/V)  messages  and  only  the  last  forwarded  mes¬ 
sage  is  able  to  defeat  owner  j  then  it  may  take 
0(N)  time  to  capture  j  (since  the  messages  are 
forwarded  on  the  same  link  and  inter- message  de¬ 
lay  on  the  same  link  can  be  1  time  unit,  the  last 
forwarded  message  may  reach  j  after  0{N)  time 
units).  We  modify  A  to  obtain  £  in  which  there 
is  at  most  one  forwarded  message  on  a  link  at  any 
time.  In  £^  a  captured  node  j  uses  a  boolean 
variable  forwardj  to  keep  track  of  whether  or 
not  it  has  forwarded  a  message  to  its  owner.  If 
j  receives  a  capture  message  and  forwardj  is 
true  then  it  delays  forwarding  the  message  to 
its  owner  until  it  receives  a  response  from  its 
owner.  Each  message  forwarded  to  the  owner  is 
responded  by  an  accept  or  a  reject  message  de¬ 
pending  on  whether  the  forwarded  message  de¬ 
feated  the  owner.  If  an  accept  message  is  received 
then  j  sends  an  accept  message  to  the  node  from 
which  it  has  received  the  largest  {level,  id)  pair 
so  far.  If  a  reject  message  is  received,  j  forwards 
the  message  with  the  largest  {level, id)  pair  it 
may  have  received  in  the  meanwhile.  Thus,  in  £, 
if  a  node  is  able  to  capture  another  node  then  it 
does  so  in  a  constant  amount  of  time. 

In  P,  if  the  number  of  candidates  is  restricted 
to  0{k)  then  it  will  require  0{Nk)  messages.  In 
protocol  £,  there  can  be  at  most  k  nodes  at  level 
N/k  [AG85].  We  obtain  a  new  protocol  P  in 
which  £  is  used  to  reduce  the  number  of  candi¬ 
dates  for  protocol  T>  by  requiring  a  node  to  exe¬ 
cute  £  until  its  level  number  reaches  Njk  and  V 
thereafter,  where  logN  <  k  <  N.  The  protocol 

is  as  follows: 

On  waking  up  spontaneously,  a  node  starts 
executing  protocol  £.  When  a  node  reaches 
level  Njk,  it  sends  an  elect  message  with 
its  identity  on  all  incident  edges.  Let  node 
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j  receive  an  elect{i)  message  over  e.  If 
{level j,maxidj)  is  less  than  {N/k,i)  then  j 
changes  status  to  killed  and  sends  an  accept 
message  over  e.  A  node  which  receives  an 
accept  message  on  all  incident  edges  declares 
itself  the  leader. 

Since  the  message  complexity  of  £  is 
0{NlogN)  and  at  most  k  nodes  broadcast  an 
elect  message,  the  message  complexity  of  T  is 
0{Nk)  (since  k  >  logN).  Since  it  takes  a  con¬ 
stant  amount  of  time  to  capture  a  node,  it  will 
take  0{N/k)  time  for  a  node  to  reach  level  N/k 
after  it  wakes  up  (if  it  reaches  this  level).  After  a 
node  reaches  this  level,  it  executes  X>  which  takes 
0(1)  time.  Therefore,  the  node  which  is  elected 
the  leader  will  take  0{N/k)  time  after  waking  up 
spontaneously  to  declare  itself  the  leader.  Thus, 
we  have  the  following  lemma; 

Lemma  4.1  If  all  nodes  wake  up  tvithin  0{N/k) 
time  of  each  other,  then  7  will  terminate  in 
0{N/k)  time. 

However,  a  situation  similar  to  the  one  in  the 
first-phase  of  protocol  A  for  networks  with  sense 
of  direction  (in  which  i  +  1  wakes  up  just  before 
the  message  from  i  reaches  it)  can  occur  which 
may  lead  to  an  execution  in  which  the  node  which 
is  elected  the  leader  wakes  up  0{N)  time  units 
after  the  first  node  wakes  up.  Since  nodes  are 
unable  to  distinguish  between  the  incident  edges, 
the  solution  used  in  the  presence  of  sense  of  di¬ 
rection  will  not  work  here.  However,  we  also  have 
the  following  result:  After  a  node  i  reaches  level 
k,  only  a  node  at  level  at  least  k  can  capture  it. 
Hence,  in  every  interval  of  c  time  units  after  a 
node  reaches  level  k,  where  c  is  a  constant,  either 
the  node  with  the  highest  level  number  (which  is 
greater  than  k)  will  increase  its  level  number  or 
it  will  be  killed  by  another  node  with  level  num¬ 
ber  at  least  k.  Since  there  are  at  most  N/k  nodes 
at  level  k,  some  node  will  reach  level  N/k  within 
2cN/k  time.  Thus,  we  have  the  following  lemma: 


Lemma  4.2  After  a  node  reaches  level  k,  ter¬ 
minates  within  0{N/k)  time. 

In  the  following,  we  will  design  a  protocol,  Q,  in 
which  we  will  ensure  that  in  every  interval  of  c 
time  units,  either  at  least  k  nodes  wake  up  or 
some  node  reaches  level  at  least  k.  Then  from 
Lemma  4.1  and  Lemma  4.2,  the  protocol  will  re¬ 
quire  0{N/k)  time.  For  this  purpose,  we  require 
a  base  node  to  execute  two  initial  phases  on  wak¬ 
ing  up  spontaneously.  If  it  successfully  executes 
these  phases,  it  qualifies  as  a  candidate  for  elec¬ 
tion  and  proceeds  by  executing  T.  Intuitively, 
the  time  complexity  of  the  leader  election  proto¬ 
col  depends  on  the  ability  to  recognize  the  order 
in  which  nodes  wake  up  to  participate  in  the  pro¬ 
tocol  so  that  nodes  that  wake  up  later  are  pro¬ 
hibited  from  becoming  candidates.  In  the  first 
phase,  a  node  tries  to  obtain  permission  from  k 
other  nodes.  If  t  requests  permission  from  j  after 
i  has  finished  executing  its  first  phase,  it  denies 
permission  to  i.  In  this  case,  i  gets  ordered  after 
j  and  is  not  allowed  to  participate  as  a  candi¬ 
date.  However,  as  we  will  show  later,  this  allows 
a  node  which  wakes  up  0{N/k)  time  units  af¬ 
ter  the  first  node  wakes  up  to  obtain  permission 
from  k  other  nodes  and  participate  as  a  candi¬ 
date.  The  first-phase  is  as  follows: 

On  waking  up  spontaneously,  node  t  selects  k  in¬ 
cident  edges  and  sends  a  fiTst-phase{i)  message 
on  each  of  these  edges.  On  receiving  this  mes¬ 
sage,  site  j  behaves  as  follows: 

•  If  J  is  not  a  captured  node  then 

if  J  has  finished  executing  its  first  phase  then 
it  sends  a  finish  message  over  e. 

★  If  j  is  passive  then  i  becomes  j’s  owner 
and  j  marks  e  as  owner-linkj.  It  sends  an 
accept  message  over  e  and  changes  its  state 
to  captured. 

*  If  y  is  in  the  first  phase  then  j  sends  a 
proceed  message  over  e. 

•  If  j  is  captured  then  it  checks  whether  its 
owner  has  finished  the  first  phase.  For 
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this  purpose,  it  sends  a  check  message  over 
owner-linkj  (if  it  has  already  sent  a  check 
message,  it  waits  for  the  reply  to  avoid  con¬ 
gestion).  If  the  owner  replies  that  it  has  fin¬ 
ished  the  first  phase,  then  it  sends  a  finish 
message  to  i  and  to  any  other  node  from 
which  it  has  received  (or  will  receive  in  the 
future)  a  first-phase  message  in  the  mean¬ 
time.  If  the  owner  has  not  finished  the  first 
phase  then  j  sends  a  proceed  message  to  i 
and  also  to  any  other  node  from  which  it 
may  have  received  a  message  in  the  mean¬ 
time. 

After  node  t  has  received  responses  to  all  A; 
first-phase  messages,  it  behaves  as  follows;  It 
exits  the  first  phase.  If  it  has  received  a  finish 
message  then  it  does  not  enter  the  second  phase 
and  changes  status  to  killed.  Otherwise,  it  en¬ 
ters  the  second  phase.  It  also  updates  its  level 
number  to  the  number  of  accept  messages  re¬ 
ceived  in  the  first  phase.  In  the  second  phase, 
node  i  tries  to  reach  level  k.  For  this  purpose,  it 
sends  a  capture{leveli,i)  message  on  each  edge 
on  which  it  received  a  proceed  message.  The 
rules  for  capturing  are  the  same  as  in  proto¬ 
col  E  with  the  following  changes;  Nodes  which 
have  not  started  the  second  phase  are  regarded 
ac  passive  by  these  capture  messages.  A  node 
increases  its  level  number  only  after  receiving 
an  accept  response  to  each  capture  message  sent 
in  the  second  phase.  After  finishing  the  second 
phase,  a  node  executes 

In  a  base  node  finishes  its  first  phase  within 
5  time  units  of  waking  up  [Si92].  Furthermore, 
there  will  be  a  node  which  enters  the  second 
phase  and  a  node  which  finishes  the  second  phase 
to  participate  in  !F.  Hence,  the  protocol  will  elect 
a  leader.  Using  the  technique  in  [BKWZ87],  we 
also  extend  our  protocol  to  obtain  a  protocol  re¬ 
silient  to  /  initial  site  failures,  where  /  <  iV/2, 
which  requires  0{N f  -1-  NlogN)  messages  and 
0{N flogN)  time. 


Lemma  4.3  The  time  complexity  of  G  is 
OiN/k). 

Proof;  A  base  node  will  finish  its  first  phase 
within  5  time  units  of  waking  up.  If  node  i  wakes 
up  spontaneously  at  time  t  then  for  i  to  partici¬ 
pate  in  the  second  phase,  each  of  its  first-phase 
messages  must  go  to  a  node  which  is  in  its  first 
phase  (this  node  must  have  awakened  sponta¬ 
neously  after  time  t  —  5,  otherwise  it  will  have 
finished  its  first  phase  by  time  t)  or  is  passive 
(this  node  will  wake  up  by  time  t  -f  1  as  a  result 
of  i's  message).  Therefore,  i  is  able  to  proceed 
to  the  second  phase  only  if  at  least  k  nodes  other 
than  i  wake  up  in  the  interval  [t— 5,  t  -|- 1]. 

Consider  an  interval  of  5  time  units,  say 
[m,  m  -|-  5]  where  m  >  0,  during  the  execution 
of  the  protocol.  We  have  the  following  cases: 

(1)  At  least  A:  1  nodes  wake  up  in  the  interval 
[m— 5,m-|-  6]. 

(2)  Less  than  A:  -h  1  nodes  wake  up  in  the  interval 
[m  — 5,  m  -f-  6).  In  this  case,  we  will  show  that 
some  node  will  reach  level  at  least  k.  All  nodes 
that  wake  up  before  time  m  wUl  finish  their  first 
phase  by  time  m  -t-  5.  Any  node  which  completes 
its  first  phase  in  the  interval  [in-|-5,  m-|-6]  will  not 
be  able  to  participate  in  .F  as  a  candidate  (since 
it  must  have  awakened  in  the  interval  [m,  m  -f-  6] 
and  less  than  A;  -|- 1  nodes  wake  up  in  the  interval 
[m-5,  m  -f  6]).  If  a  node  has  not  already  reached 
level  k  then  let  i  be  the  node  with  the  highest 
identity  among  the  nodes  which  cire  in  the  sec¬ 
ond  phase  at  time  m  -|-  5.  Let  a  capture  message 
from  i  reach  a  node  j  which  is  not  captured.  We 
have  the  following  cases: 

(a)  If  j  has  not  started  the  second  phase  then  it 
will  respond  with  an  accept  message. 

(b)  If  j  has  started  the  second  phase  then  it  must 
be  the  case  that  j  entered  the  second  phase  at 
or  before  time  m  -i-  5  and  therefore  j  <  i  (by  as¬ 
sumption).  Hence,  j  will  respond  with  an  accept 
message. 
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If  j  is  a  captured  node  then  it  will  send  for¬ 
ward  the  message  to  its  owner.  If  the  owner j 
has  not  started  second  phase  then  it  will  send 
an  accept  message.  Otherwise,  we  will  show  that 
owner  j  must  have  entered  the  second  phase  be¬ 
fore  time  m  -|-  5.  Assume  not.  Since  nodes  that 
wake  up  before  time  m  enter  the  second  phase 
before  time  m  +  5  and  nodes  that  wake  in  the  in¬ 
terval  [m,  m-|-5]  do  not  participate  in  the  second 
phase,  owner  j  must  have  awakened  after  time 
m  -H  5.  In  this  case,  the  first-phase  message  from 
owner  j  will  reach  j  after  time  m  -H  5.  However, 
by  this  time,  i  would  already  have  sent  its  first- 
phase  message  to  j  and  therefore,  owner  j  cannot 
capture  j.  Hence,  owner  j  must  have  finished  the 
second  phase  before  time  m  -f  5  in  which  case  it 
will  send  an  accept  message  to  i  since  owner  j  <  i 
by  assumption.  Thus,  i  will  receive  all  accept 
responses  and  therefore  i  will  finish  its  second 
phase. 

Hence,  in  each  interval  of  11  time  units  (m-5 
to  m  -I-  6),  either  at  least  k  nodes  wake  up  or 
some  node  will  reach  level  k.  Therefore,  by 
time  lliV/fc,  either  all  nodes  will  be  awake  or 
some  node  will  have  reached  level  k.  Then,  from 
Lemma  4.1  and  Lemma  4.2,  the  protocol  will  ter¬ 
minate  in  0{Nlk)  time  units.  □ 

In  [Si92],  we  show  that  the  time  complex¬ 
ity  of  Q  depends  of  the  number  of  candidate 
nodes.  By  using  the  capturing  pattern  of  the 
synchronous  protocol  in  [AG85],  we  have  ob¬ 
tained  a  message  optimal  protocol  which  requires 
0{logN  +  min{r,NflogN)),  where  r  is  the  num 
ber  of  candidate  nodes. 

5  A  Lower  Bound 

We  will  now  prove  that  Q{N/logN)  is  a  lower 
bound  on  the  time  complexity  of  any  message 
optimal  protocol  for  leader  election  in  complete 


asynchronous  networks.  We  will  restrict  our¬ 
selves  to  comparison-based  leader  election  algo¬ 
rithms.  We  prove  the  following  theorem: 

Theorem  5.1  Any  comparison-based  protocol 
for  leader  election  in  a  complete  asynchronous 
network  which  sends  less  than  Nd  messages  will 
require  at  least  N/16d  time. 

Corollary  5.1  Any  message  optimal  protocol 
for  leader  election  in  a  complete  asynchronous 
network,  i.e.,  requiring  0{N log N)  messages,  will 
require  il{N/logN)  time. 

Proof  of  Theorem  5.1:  For  simplicity,  assume 
that  nodes  have  identities  belonging  to  the  set 
{1, . . . ,  N]  and  let  k  =  2d.  A  one-to-one  function 
/  from  a  set  of  processor  identities  to  another  set 
of  processor  identities  is  order-preserving  if  i  <  j 
implies  f(i)  <  f{j).  Two  lists  and 

{yi,...,ym}  are  order-equivalent  if  (x,  <  xj) 
iVi  <  Vj)’  Following  [FL87],  we  assume  that 
each  processor’s  local  state  consists  of  its  iden¬ 
tity,  its  initial  state  and  the  history  of  messages 
it  has  received  so  far.  Further,  a  node  sends  its 
entire  local  state  in  each  message.  Intuitively, 
a  process  state  includes  all  events  that  can  po¬ 
tentially  affect  this  state  [Lam78]  and  the  hap¬ 
pens  before  ordering  information  between  these 
events.  Let  event{i,t)  denote  the  set  of  events 
which  can  potentially  affect  the  state  of  i  at  time 
t  in  an  execution.  We  say  that  event{i,t)  and 
event{j,  t')  are  order-equivalent  if  there  exists  an 
order-preserving  function  which  maps  event{i,t) 
to  event(j,f)  such  that  the  happens-before  re¬ 
lation  is  preserved.  If  event{i,t)  and  event{j,if) 
are  order-equivalent  then  we  say  that  the  state 
of  i  at  time  t  and  state  of  j  at  time  t'  are  order- 
equivalent.  A  comparison-based  protocol  cannot 
distinguish  between  order-equivalent  states.  Our 
proof  involves  showing  that  we  can  keep  pro¬ 
cesses  in  order-equivalent  states  for  a  long  period 
of  time. 
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Let  ^  be  a  protocol  for  leader  election  which 
requires  Nd  messages.  Let  Ex  be  the  set  of  exe¬ 
cutions  of  A  in  which  (1)  all  nodes  wake  up  spon¬ 
taneously  at  the  same  time,  (2)  all  messages  take 
€  time  to  reach  their  destination,  where  e  <  1/2 
and  (3)  if  a  set  of  messages  arrive  at  the  same 
time  at  a  site  then  the  messages  are  accepted 
in  the  increasing  order  of  sender  identities.  Let 
Upi  denote  the  ordered  list  of  edges  from  i  to 
nodes  i  -f-  1, . . . ,  i  Ar,  arranged  in  the  increas¬ 
ing  order  of  sites  identities  and  Dowrii  denote 
the  set  of  edges  from  i  to  nodes  with  identities 
i— 1, . . . ,  i—k.  Since  a  node  cannot  distinguish  be¬ 
tween  untraversed  incident  links,  the  adversary 
has  the  freedom  to  choose  any  untravorsed  edge 
whenever  the  node  wants  to  send  a  message  over 
an  untraversed  edge.  In  particular,  the  adversary 
acts  as  follows:  Whenever  a  node  i  has  to  send 
a  message  over  a  new  edge,  the  adversary  selects 
the  edges  first  from  Upi.  If  all  edges  in  Upi  have 
been  used  then  it  selects  other  unused  edges.  The 
actions  of  the  adversary  try  to  impose  a  symme¬ 
try  on  the  nodes.  We  partition  the  nodes  into 
the  following  sets:  {5i, . . . ,  5^^/;^})  where  5,-  = 
l,...,A:i}.'Let  R  =  S2  U  ...U  5,n-i, 
where  m  =  N jk  and  R'  =  S\  (J  Sm-  As  long 
as  nodes  in  R  remain  in  order-equivalent  states, 
each  node  i  will  only  communicate  with  nodes 
in  Upi  U  Dowrii;  otherwise  each  node  in  R  will 
send  messages  over  at  least  k  +  I  edges  and  the 
number  of  messages  will  exceed  Nd.  This  fact 
and  conditions  (l)-(3)  impose  a  symmetry  on 
nodes  in  R.  Intuitively,  nodes  in  R  cannot  break 
symmetry  without  communicating  with  nodes  in 
R'.  Each  node  i  m  R  executes  the  sarrie  pro¬ 
tocol  and  communicates  with  at  most  k  nodes 
with  larger  identities  (belonging  to  Upi)  and  at 
most  k  nodes  with  smaller  identities  (belonging 
to  Dowrii).  However,  nodes  in  R'  are  asymmet¬ 
ric  with  respect  to  these  nodes.  In  any  execution, 
let  nodes(i,t)  denote  the  set  of  sites  at  which 
events  in  €vent(i,  t)  jccur.  Then,  for  example, 
at  time  c,  sites  1  and  t,  where  i  ^  R,  may  not  be 


in  order-equivalent  states  since  1  may  receive  a 
message  from  a  node  with  a  higher  identity  while 
i  only  receives  messages  from  nodes  with  lower 
identities.  However,  at  this  time,  eveni(i,  e)  and 
event(j,€),  where  i,j  €  R,  are  order-eqmvalent. 
Hence,  any  node  i  in  R  will  only  send  messages 
to  nodes  in  Upi  U  Dowrii  at  time  e.  Observe  that 
nodes{i,e)  C  5y_i  U  5y  U  5y+i,  where  i  6  Sy. 
If  1  sends  a  message  at  time  e  to  site  i  €  S2 
then  it  can  force  i  and  another  site  j  ^  R  to 
be  in  order-inequivalent  states  at  time  26.  How¬ 
ever,  for  each  node  i  6  ^3  U  •  •  •  U  Sm-2,  if 
nodes(i,  2e)  C  Sy^2  U  •  ■  •  U  Sy+2  then  nodes  in 
•  •U5'm-2  will  be  in  order-equivalent  states  at 
time  26  (in  this  case,  no  node  would  have  received 
a  message  at  time  26  from  a  node  in  R'  which, 
at  time  6,  was  in  a  state  order-inequivalent  with 
respect  to  states  of  nodes  in  R).  In  general,  we 
prove  the  following:  Let  Mx  =  U  •  •  •  U  S^-x 
and  depth{i,  t)  denote  the  longest  chain  of  mes¬ 
sages  involving  events  in  event{i,t). 

Lemma  5.1  There  exists  an  execution  in  Ex 
such  that  at  any  time  t  and  i  ^  Sy  C\  Mx,  where 
X  <  mjA,  if  depth{i,t)  <  x  and  nodes{i,t)  C 
• -USy+i,  then  for  all  j  €  Mx,  event{i,t) 
and  event{j,  t)  are  order-equivalent. 

Proof  Outline:  We  prove  this  by  induction  on  x. 
The  result  is  immediate  for  i  =  0  since  processes 
are  initially  in  order-equivalent  states.  Assume 
that  the  hypothesis  holds  for  z  <  /.  Then  there 
exists  an  execution  in  Ex  such  that  at  any  time  t, 
\i depth{i,t)  <  I  and  nodes{i,t)  C  Sy-iU-  •  -USy+i 
then  for  all  j  €  Mi,  event(i,t)  and  event(j,t) 
are  order-equivalent.  Let  t  be  the  maximum 
time  in  this  execution  at  which  depih{i,t)  <  I 
for  i  G  Ml  (i.e.,  any  message  sent  after  this 
time  increases  the  depth).  Assume  that  at  some 
time  t'  in  this  execution,  depth{i,  P)  =  /  -f- 1  and 
no(’es{i,t')  C  U  •••U  Sy+/+i,  where  i  G 

Mi+i.  Let  j  G  Mi+i.  Since  :,J  G  Mi,  event(i,t) 
and  event(j,t)  are  order-equivalent.  Let  p  send 
a  message  to  i  at  time  t  which  is  in  event(i,  t') 
but  not  in  event(i,t).  Then  depth{p,t)  =  /  and 
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p  ^  Ml  (otherwise,  we  can  show  a  contradiction 
since  i  ^  UppU  Dowrip).  Let  q  =  j-ii-p). 
Then  q  €  Af/.  From  the  induction  hypothesis, 
event{p,t)  and  event{q,t)  are  order-equivalent. 
Hence,  q  will  also  send  a  message  to  j.  There¬ 
fore,  ev€nt(i,t*)  and  event{jff)  will  be  order- 
equivalent.  □ 

Lemma  5.2  There  exists  an  execution  in  Ex 
such  that  at  any  time  t  and  i  €  A/,,,/4  n  Sy, 
if  nodes(^tjt^  C  ^y~rn/A  ^  •  •  •  U  and 

depth{i,t)  <  mjA  then  i  must  have  communi¬ 
cated  only  with  nodes  in  Upi  and  Downi. 

Proof.  Assume  not.  Assume  that  i  sends  a 
message  to  a  node  not  in  Upi  U  Doion,.  Let 
j  e  MjnjA-  Then  by  Lemma  5.1,  there  exists 
an  execution  in  which  €vent(i,  t)  and  event{j,  t) 
are  order-equivalent.  Since  i  sends  a  message  to 
a  node  not  in  Upi  U  Downi,  j  will  also  send  a 
message  to  a  node  not  in  Upj  U  Dowuj  (since 
order-equivalent  states  are  indistinguishable  to 
a  comparison-based  protocol).  Thus,  each  node 
in  Mm! A  will  send  at  least  fc  -f- 1  messages.  Since 
|Afm/4l  =  ■^/2j  the  execution  will  involve  at  least 
iV(fc-|- 1)/2  messages  which  is  a  contradiction.  □ 

We  will  make  use  of  symmetry  between  nodes 
in  R  to  construct  an  execution  in  which  nodes  in 
R*  wake  up  much  later  in  the  protocol  to  break 
symmetry.  Let  ex  be  an  execution  of  A.  Let 
the  nodes  be  partitioned  into  three  sets.  Pi,  P2 
and  P3  such  that  Pi  =  {l,...,pi},  P2  =  {pi  -|- 
1,...,P2}  and  P3  =  {p2  +  Assume 

that  (1)  nodes  in  Pi  and  P3  wake  up  at  time  t, 
(2)  no  messages  have  been  sent  by  nodes  in  P2 
to  nodes  in  Pi  and  P3  up  to  time  t  and  (3)  all 
messages  sent  at  or  after  time  t  take  e  time  units. 
Let  g(ex,P2)  denote  the  execution  in  which  all 
nodes  in  P2  wake  up  1-c  time  ea'-lier  than  in 
ex  but  all  links  incident  on  nodes  in  P2  remain 
idle  in  the  interval  [t-(l-c),t]  i.e.,  transmission 
of  messages  does  not  make  progress  during  this 
period.  Due  to  the  asynchronous  nature  of  the 


network,  no  node  can  distinguish  between  ex  and 
g(ex,  P2).  This  technique  of  increasing  message 
transmission  time  without  violating  the  happens 
before  ordering  is  similar  to  the  one  in  [AFL83]. 

Let  M  be  the  set  of  messages  in  ex  sent  at 
time  t  from  i  to  j,  where  i,j  €  P2  and  j  receives 
no  messages  from  nodes  in  Pi  U  P2  at  time  t  +  e. 
Let  h{ex,  Pj)  denote  the  execution  in  which  links 
incident  on  nodes  in  P2  are  not  idle  in  the  period 
[t— (1— c),t]  but  all  messages  sent  in  this  interval 
except  those  in  M  take  e  +  (1  -e)  time.  The 
effect  of  this  transformation  is  to  increase  the 
delays  on  the  links  from  c  to  e  -f-  1  —  e,  which 
cannot  be  distinguished  from  links  remaining  idle 
for  1— c  time  units.  Hence,  ex  and  h{ex,P2)  are 
indistinguishable  to  all  nodes. 

We  will  construct  a  set  of  executions  in  a  se¬ 
ries  of  steps  which  takes  0{N/k)  time.  Let  ex  be 
an  execution  in  Ex  which  satisfies  Lemma  5.2. 
Let  Ai  =  Si  U  •  •  •  U  Sjnj2—lf  ~  ^m/2 
Cl  =  Sm/2+i  U  •  •  •  U  Sm,  where  m  -  N/k.  All 
nodes  wake  up  at  time  t  =  0  and  no  messages 
are  sent  before  that  time.  Furthermore,  all  mes¬ 
sages  take  e  time  units.  Hence,  ex  and  h{ex,  Bi) 
are  indistinguishable  to  all  nodes.  Let  exi  = 
h(ex,Pi).  Let  A2  =  Si  U  •  •  •  U  Sn,/2_2>  P2  = 
•^m/2— l'-*‘^mLlS„i/2-(-l  nnd  C2  =  •S't7j/2+2U‘  •  -USfn. 
In  exi,  no  messages  are  sent  to  nodes  in  A2  and 
C2  before  time  1— €  (since  nodes  in  5^/2  commu¬ 
nicate  only  with  nodes  in  Sm/2-1  Sm/2+1 

til  nodes  not  in  R'  wake  up  (From  Lemma  5.2)). 
Further,  all  messages  sent  at  or  after  time  1  —  € 
take  €  time  units.  Therefore,  exi  and  h{exi,B2) 
d^e  indistinguishable  to  all  nodes.  Since  exi 
and  ex  are  indistinguishable,  ex  and  h(exi,P2) 
are  also  indistinguishable  to  aU  nodes.  Let  ex2 
=  h(exi,P2)-  Continuing  in  this  way,  we  can 
construct  an  execution  eXm/Ai  where  A^^/a  = 
5i  U  •  •  •  U  Sjn/A^  ^m/A  =  Sm/A+l  U  •  •  •  U  S^m/A  ^nd 
Cm/4+1  =  ■53m/4U-  •  -USm,  which  is  indistinguish¬ 
able  from  ex.  The  execution  time  of  ex^jA  at 
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least  Tn/4(l-€).  Since  m/4(l-t)  =  N/4k{l  —  €) 
=  N/Sd{l~€)  >  iV/8d(l/2)  =  N/16d,  we  have 
an  execution  which  requires  N/16d  time.  □ 

6  Conclusion 

In  this  paper,  we  have  presented  distributed  algo¬ 
rithms  for  leader  election  in  complete  networks. 
For  asynchronous  networks  with  a  sense  of  direc¬ 
tion,  we  first  presented  a  simple  protocol  which 
requires  0{N)  messages  and  O(V^)  time.  We 
then  improved  the  time  complexity  of  this  pro¬ 
tocol  to  0{logN)  time.  An  interesting  question 
is  whether  synchronized  clocks  can  be  used  to 
improve  the  time  complexity  of  this  protocol. 

For  completer  asynchronous  networks  without 
sense  of  direction,  we  also  showed  a  lower  bound 
of  il{N/logN)  on  the  time  complexity  of  any 
message  optimal  leader  election  protocol.  We 
presented  a  protocol  which  requires  0(N/logN) 
time  and  0{NlogN)  messages.  In  [AG85],  a 
lower  bound  of  il(logN)  on  the  time  complexity 
of  any  message  optimal  protocol  for  synchronous 
complete  networks  was  shown.  This  proves  that 
introducing  asynchrony  may  result  in  a  loss  in 
speed  by  a  factor  of  N/{logN)^.  In  [Si92],  we 
study  the  problem  of  leader  election  in  partially 
synchronous  networks  and  present  lower  bounds 
for  such  networks. 
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Abstract 

The  session  problem  is  an  abstraction  of  synchroniza¬ 
tion  problems  in  distributed  systems.  It  has  been  used 
as  a  test-case  to  demonstrate  the  differences  in  the  time 
needed  to  solve  problems  in  various  timing  models,  for 
both  shared  memory  (SM)  systems  [2]  and  mess^e- 
passing  (MP)  systems  [4].  In  this  paper,  the  session 
problem  continues  to  be  used  to  compare  timing  mod¬ 
els  quantitatively.  The  session  problem  is  studied  in  two 
new  timing  models,  the  periodic  and  the  sporadic.  Both 
SM  and  MP  systems  are  considered.  In  the  periodic 
model,  each  process  takes  steps  at  a  constant  unknown 
rate;  different  processes  can  have  different  rates.  In  the 
sporadic  model,  there  exists  a  lower  bound  but  no  up¬ 
per  bound  on  step  time,  and  message  delay  is  bounded. 
We  show  upper  and  lower  bounds  on  the  time  complex¬ 
ity  of  the  session  problem  for  these  models.  In  addition, 
upper  and  lower  bounds  on  running  time  are  presented 
for  the  semi-synchronous  SM  model,  closing  an  open 
problem  from  [4].  Our  results  suggest  a  hierarchy  of 
various  timing  models  in  terms  of  time  complexity  for 
the  session  problem. 

1  Introduction 

Early  work  in  distributed  computing  usutdly  as¬ 
sumed  one  of  two  extreme  timing  models:  either 
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the  completely  synchronous  model,  in  which  pro¬ 
cesses  operate  in  lockstep  rounds  of  computation, 
or  the  completely  asynchronous,  in  which  there  are 
no  upper  bounds  on  process  step  time  or  message 
delay.  Since  both  of  these  timing  assumptions  are 
often  unrealistic,  researchers  began  to  investigate 
the  impsict  on  distributed  computing  if  those  tim¬ 
ing  assumptions  are  relaxed  or  tightened  to  some 
extent  in  order  to  reflect  the  real  time  situation. 
This  question  has  been  studied  for  a  variety  of  prob¬ 
lems,  including  Byzantine  agreement  [7,  8,  1,  13], 
mutual  exclusion  [3],  leader  election  [5],  transaction 
commit  [6],  and  the  session  problem  [2,  4]. 

The  (s,  n)-session  problem,  first  presented  in  [2], 
is  an  abstraction  of  the  synchronization  needed 
to  solve  many  distributed  computing  problems. 
Therefore,  it  is  an  importamt  tool  for  understand¬ 
ing  the  behavior  of  distributed  systems  under  dif¬ 
ferent  timing  constraints.  Informally,  a  session  is 
a  minimal-length  computation  fragment  that  in¬ 
volves  at  least  one  “synchronization”  step  by  ev¬ 
ery  process  in  a  distinguished  set  of  n  processes. 
An  algorithm  that  solves  the  (s,  n)-se8sion  problem 
must  guarantee  that  in  every  computation  there  are 
at  least  s  disjoint  sessions  and  eventually  all  the  n 
processes  become  idle. 

We  study  the  problem  in  two  different  interpro¬ 
cess  communication  models:  shared  memory  and 
message  passing.  In  the  shared  memory  model,  pro¬ 
cesses  communicate  only  by  means  of  shared  vari¬ 
ables.  Each  variable  is  shared  by  no  more  than  6 
processes,  where  6  is  a  constant  relative  to  the  to¬ 
tal  number  oi  processes.  In  the  message  passing 
model,  communication  is  done  by  exchanging  mes¬ 
sages  across  a  network.  A  process  can  broadcast  a 
message  at  a  step;  the  message  is  guaranteed  to  be 
delivered  to  every  process  after  some  finite  time. 

The  relevant  timing  aspects  of  a  model  are  the 
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lower  bound  on  process  step  time,  ci,  the  upper 
bound  on  process  step  time,  cq,  and  additionally, 
for  the  message  passing  model,  the  lower  bound  on 
message  delay,  di ,  and  the  upper  bound  on  message 
delay,  dj.  The  running  time  of  an  algorithm  for  the 
(s,  n)-session  problem  is  the  maximum  time,  over 
all  computations,  until  all  the  n  processes  become 
idle.  If  there  is  an  upper  bound  in  real  time  on  Cj 
and  d2,  then  it  makes  sense  to  measure  the  running 
time  in  terms  of  real  time.  If  not,  then  the  common 
way  to  measure  the  running  time  is  with  rounds.  A 
round  is  a  minimal  computation  fragment  in  which 
every  process  takes  at  least  one  step. 

Arjomandi,  Fischer  and  Lynch  [2]  studied  the 
(s,  n)-session  problem  in  synchronous  and  asyn¬ 
chronous  shared  memory  models.  Synchronous 
means  that  ci  =  cj,  a  finite  number.  Asynchronous 
means  that  ci  and  C2  are  infinite.  Their  results 
showed  a  significant  time  complexity  gap  between 
the  synchronous  and  asynchronous  models,  namely 
that  s  rounds  are  sufficient  for  the  synchronous 
case  but  (s  —  l)[log5nJ  rounds  are  necessary  for 
the  asynchronous  case,  where  n  is  the  size  of  the 
distinguished  set  of  processes.  The  implication  is 
that  no  communication  is  needed  at  all  in  the  syn¬ 
chronous  case,  but  it  is  needed  for  every  session  in 
the  asynchronous  case.  (The  (logj  nj  factor  is  es¬ 
sentially  the  cost  of  communication  when  no  more 
than  6  processes  can  access  any  shared  variable.) 

Attiya  and  Mavronicolas  [4]  addressed  the  prob¬ 
lem  in  semi-synchronous  and  asynchronous  message 
passing  systems.  Semi-synchronous  means  that 
Cl  >  0,  C2  and  d2  are  finite,  and  these  constants  are 
known  to  the  processes.  They  modeled  the  asyn¬ 
chronous  system  differently  than  [2]:  they  let  ci  =  0 
and  di  =  0,  while  C2  and  d2  are  finite.  Their  re¬ 
sults  also  indicated  an  important  time  separation 
between  semi-synchronous  and  asynchronous  net¬ 
works,  again  based  on  whether  or  not  communica¬ 
tion  is  necessary. 

We  present  almost  matching  upper  and  lower 
bounds  for  the  session  problem  in  the  semi- 
synchronous  shared  memory  model.  Our  bounds 
are  similar  to  those  in  [4]  for  the  message  pass¬ 
ing  model  when  the  cost  for  informa ‘ion  propa¬ 
gation  in  the  shared  uicmory  model  is  substituted 
for  the  message  delay.  They  indicate  that  if  the 
time  for  one  communication  is  less  than  that  for 
one  step  multiplied  by  the  ratio  of  C2  and  ci,  the 
model  behaves  like  the  asynchronous;  otherwise  it 
behaves  like  the  synchronous  (inflated  by  the  ra¬ 


tio).  Mavronicolas  [12]  has  also  independently  de¬ 
veloped  the  same  lower  and  upper  bounds  for  the 
shared  memory  semi-synchronous  model. 

We  introduce  two  new  timing  models  for  the 
(s,  n)-8ession  problem:  the  periodic  and  the  spo¬ 
radic.  In  the  periodic  model,  for  each  process  there 
exists  an  unknown  constant  such  that  the  process 
makes  one  step  at  every  period  of  the  constant.  In 
the  message-passing  variant,  d2  is  finite  and  known. 
The  upper  bounds  for  both  the  shared  memory  and 
message  passing  models  are  the  time  for  the  slow¬ 
est  process  to  take  s  steps  plus  the  time  for  one 
communication.  The  lower  bounds  for  both  are  the 
maximum  of  the  time  for  the  slowest  process  to 
take  s  steps  and  approximately  the  time  for  one 
communication.  Our  results  indicate  that  the  pe¬ 
riodic  model,  which  requires  one  communication, 
falls  in  between  the  synchronous  and  asynchronous 
models,  which  require  no  and  s  —  1  communications 
respectively. 

In  the  sporadic  model,  there  exists  a  nonzero 
lower  bound  ci,  but  no  upper  bound,  on  the  time 
between  any  two  consecutive  steps  of  any  process. 
The  sporadic  shared  memory  model  is  essentially 
equal  to  the  asynchronous  shared  memory  model 
and  is  not  considered.  For  the  message  passing 
model,  the  message  delay  is  within  [di,  d^i,  where 
di  >  0,  d2  is  finite,  and  both  are  known.  The 
combination  of  the  lower  bound  on  step  time  and 
upper  bound  on  message  delay  allows  processes  to 
make  inferences  about  the  computation,  namely, 
that  enough  time  has  elapsed  so  that  a  message 
must  have  arrived.  The  lower  bound  on  the  per- 
session  time  is  max{[^J  /f,  ci),  where  u  =  d2—di 
and  K  =  •  The  upper  bound  on  the  per- 

session  time  is  min{( -b 3)  •  7  -t- o,  ^2  +  t}i  where 
7  is  the  leu-gest  step  time  by  a  process  before  ter¬ 
mination.  As  the  message  delay  approaches  a  con¬ 
stant  (i.e.,  di  approaches  d2),  the  per-session  time 
becomes  max{0,  ci}  =  ci  for  the  lower  bound  and 
min{37,  d2-f7}  =  0(7)  for  the  upper  bound,  which 
is  like  the  synchronous  model. 

As  the  message  delay  fluctuates  within  a  bigger 
interval  (i.e.,  cfi  approaches  0),  the  per-session  time 
becomes  max{d2,ci}  =  d2  for  the  lower  bound  and 
+  3)  •  7  +  rfz.rfz  +  7}  =  0(d2  -b  7)  for 
the  upper  bound,  which  is  like  the  asynchronous 
model. 

These  two  timing  constraints  are  inspired  by  con- 
strunts  with  the  saune  names  commonly  used  in 
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many  real-time  problems,  especially  in  scheduling 
of  real  time  tasks  for  a  uniprocessor  [9,  10,  11] 
where  the  period  of  task  occurrences  conforms  to 
the  constraints.  In  practice,  as  quoted  in  [10],  pe¬ 
riodic  timing  constraints  are  used  in  applications 
such  as  avionics  and  process  control  when  accurate 
control  requires  continual  sampling  and  processing 
of  data.  The  sporadic  timing  constraint  is  associ¬ 
ated  with  event-driven  processing  such  as  respond¬ 
ing  to  user  inputs  or  non-periodic  device  interrupts; 
these  events  occur  repeatedly,  but  the  time  inter- 
veJ  between  consecutive  occurrences  varies  and  can 
be  arbitrarily  large.  Therefore,  the  sporadic  timing 
constraint  models  processes  that  can  be  blocked  for 
an  arbitrarily  long  (but  finite)  time  waiting  for  a 
certain  condition  to  be  true  or  a  certain  event  to 
occur,  but  that  cannot  make  two  consecutive  steps 
faster  than  a  certain  lower  bound. 

Table  1  summarizes  the  bounds.  L  means  lower 
bound,  U  means  upper.  In  the  periodic  model,  Cmin 
and  Cmax  are  the  smallest  and  the  largest  step  times 
respectively  of  all  processes.  The  bounds  for  the 
asynchronous  shared  memory  case  are  in  rounds. 
The  bounds  from  [4]  have  been  converted  in  three 
aspects  for  purposes  of  comparison:  (1)  That  paper 
considers  point-to-point  networks;  thus  the  results 
include  a  factor  of  the  network  diameter.  In  our 
model,  d2  subsumes  the  diameter  factor;  we  have 
replaced  all  occurrences  of  the  diameter  factor  with 
1.  (2)  In  [4],  the  constant  1  is  used  as  the  value  of 
C2;  we  have  replaced  all  appropriate  occurrences  of 
1  with  C2.  (3)  [4]  assumes  that  all  processes  take 
their  synchronized  first  steps  at  time  0,  resulting 
in  one  session  at  time  0;  although  we  assume  that 
all  processes  start  at  time  0,  we  don’t  assume  that 
all  take  a  synchronized  step  at  time  0.  We  rather 
assume  that  all  steps  (including  the  first  step)  obey 
the  timing  constraints  of  a  specific  model  starting 
time  0. 

Our  results  indicate  that  the  periodic  model  is 
more  efficient  than  the  semi-synchronous  system 
when  Cmax  =  C2,  2ci  <  C2  and  n  is  constant  rel¬ 
ative  to  s.  The  lower  bound  for  the  sporadic  sys¬ 
tem  and  the  upper  bound  for  the  periodic  system 
suggest  that  the  periodic  system  is  more  efficient 
than  the  sporadic  system  if  Cmax  is  smaller  than 
[^J  ■  K  <  2(d2-'u/2)  constant  relative  to 

8.  In  shared  memory,  the  sporadic  system  is  clearly 
less  efficient  than  the  semi-synchronous  one,  but 
the  relationship  between  the  sporadic  and  the  semi- 
synchronous  systems  for  message  passing  is  rather 


unclear  and  understwding  it  requires  further  study. 

The  rest  of  the  paper  is  organized  as  follows. 
Section  2  contains  the  definition  of  system  models 
and  Section  3  describes  how  to  accomplish  com¬ 
munication  in  the  shared  memory  model.  Sec¬ 
tion  4  concerns  the  periodic  model.  Section  5  the 
shared  memory  semi-synchronous,  and  Section  6 
the  message-passing  sporadic.  Please  note  that 
our  lower  bound  proof  technique  combines  those 
in  [2,  4].  Some  proofs  are  omitted  or  only  sketched 
due  to  space  constraints. 

2  Definitions 

2.1  Systems 

The  generalized  system  model  definition  for  shared 
memory  and  message  passing  models  is  similar  to 
that  defined  in  [2]. 

There  are  finite  sets  P  of  processes  and  X  of 
shared  variables.  A  process  consists  of  a  set  of  in¬ 
ternal  states,  including  an  initial  state.  Each  shared 
variable  has  a  set  of  values  that  it  can  take  on, 
including  an  initial  vstlue.  A  global  state  is  a  tu¬ 
ple  of  internal  states,  one  for  each  process,  and 
values,  one  for  each  shared  variable.  The  initial 
global  state  contains  the  initial  state  for  each  pro¬ 
cess  and  the  initial  value  for  each  shared  variable. 
A  step  TT  consists  of  simultaneous  changes  to  the 
state  of  some  process  and  the  values  of  some  num¬ 
ber  of  variables,  depending  on  the  current  state  of 
that  process  and  current  values  of  the  variables. 
More  formally,  we  represent  the  step  tt  with  a  tuple 
((«.P,»’).(wi.*i.vi),  ...(«*,**,  vt)),  where  s  and  r 
are  old  and  new  states  of  a  process  pE  P;  Ui  and  r,- 
are  old  and  new  V2dues  of  a  shared  variable  Xi  E  X 
for  all  i.  We  say  that  step  t  is  applicable  to  a  global 
state  if  p  is  in  state  s  and  x.  has  value  Ui  for  all  i 
in  the  global  state. 

A  system  is  specified  by  describing  P,  X,  and  set 
E  of  possible  steps.  For  all  processes  pE  P  and  all 
global  states  g,  there  must  exist  some  step  involv¬ 
ing  process  p  that  is  applicable  to  global  state  g. 
A  computation  of  a  system  is  a  sequence  of  steps 
xi, ir2i  •  •  •  such  that:  (1)  xi  is  applicable  to  the  ini¬ 
tial  global  state,  (2)  each  subsequent  step  is  appli¬ 
cable  to  the  global  state  resulting  from  the  previous 
steps,  and  (3)  if  the  sequence  is  infinite,  then  every 
process  takes  an  infinite  number  of  steps.  That  is, 
there  is  no  process  failure.  A  timed  computation  of 
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Table  1:  Bounds  for  the  Session  Problem 


a  system  is  a  computation  s’!,  Jr2, . . .  together  with 
a  mapping  T  from  positive  integers  to  real  num¬ 
bers  that  associates  a  real  time  with  each  step  in 
the  computation.  T  must  be  nondecreasing  and,  if 
the  computation  is  infinite,  increase  without  bound. 
We  will  abuse  notation  and  let  T(5r,)  indicate  the 
time  at  which  step  ifi  occurs. 

2.1.1  Shared  Memory  Model  (SMM) 

We  specialize  the  general  system  into  the  shared 
memory  system  in  which  processes  communicate 
with  each  other  by  means  of  shared  variables.  Each 
step  -K  has  the  property  that  fc  =  1.  That  is,  it  in¬ 
volves  only  one  shared  variable.  A  process  can  read 
and  write  a  shared  variable  in  a  single  atomic  step, 
and  we  don’t  assume  any  upper  bound  on  the  size 
of  the  variables.  We  let  6  be  the  maximum  number 
of  processes  that  access  any  single  variable,  in  all 
the  steps  of  the  system.  We  assume  b  is  constant 
relative  to  the  number  of  processes. 

2.1.2  Message  Passing  Model  (MPM) 

We  specialize  the  general  system  into  the  mes¬ 
sage  passing  system,  in  which  processes  commu¬ 
nicate  with  each  other  by  exchanging  messages. 
P  =  /EU{W},  where  R  is  the  set  of  regular  process^ 
and  N  is  the  network.  The  network  schedules  the 
delivery  of  messages  sent  among  the  regular  pro¬ 
cesses.  X  =  {net]  U  {bufp  :  p  6  R],  where  the 
values  taken  on  by  each  variable  are  sets  of  mes¬ 
sages.  net  models  the  state  of  the  network,  i.e., 
the  set  of  messages  in  transit,  bufp  holds  the  set 
of  messages  that  have  been  delivered  to  p  by  the 
network  but  not  yet  received  by  p. 


A  step  of  a  process  p  in  Jl  consists  of  p  receiv¬ 
ing  the  set  M  of  messages  in  its  buffer  bufp,  and 
based  solely  on  those  message  and  its  current  state, 
changing  its  local  state  and  sending  out  some  mes¬ 
sage  m  to  all  the  regular  processes.  The  result  is 
to  set  bufp  to  empty  and  to  add  (m,  q)  to  net,  for 
all  q  in  R.  So,  the  step  involves  two  shared  vari¬ 
ables,  bufp  and  net.  A  step  of  N  is  to  deliver  some 
message  of  the  form  (m,  q)  in  net  to  q.  The  result 
is  to  remove  (m,  g)  from  net  and  add  m  to  6u/,. 
Accordingly,  the  step  also  involves  two  shared  vari¬ 
ables,  net  and  6u/,. 

This  definition  of  the  MPM  is  an  abstract  model 
of  a  reliable  strongly  connected  network  with  any 
topology. 

In  a  timed  computation,  each  message  has  a  de¬ 
lay,  defined  to  be  the  difference  between  the  time 
of  the  step  that  adds  it  to  net  and  the  time  of  the 
step  that  removes  it  from  net.  If  the  message  is 
never  removed,  then  it  has  infinite  delay.  The  delay 
only  counts  the  time  in  transit  in  the  network  and 
does  not  include  the  time  that  the  recipient  takes 
to  receive  the  message.  That  is,  the  time  elapsed 
between  the  delivery  step  and  the  step  which  fi¬ 
nally  removes  the  message  from  the  buffer  is  not 
counted  toward  the  message  delay,  even  if  the  mes¬ 
sage  remains  in  the  buffer  for  a  long  time  before  the 
recipient  picks  it  up  from  its  buffer. 

2.2  The  Real  Time  Constraints 

For  each  timing  model  considered,  we  define  the 
set  of  admissible  timed  computations  to  consist  of 
timed  computations  which  obey  the  stated  condi¬ 
tion  on  the  step  times  of  all  processes  in  the  SMM 
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(all  regular  processes  in  the  MPM)  and,  addition¬ 
ally  for  the  MPM,  the  stated  condition  on  the  mes¬ 
sage  delay. 

Synchronous  There  exist  constants  and  such 
that  in  every  timed  computation,  for  every  p  in  P 
(p  in  R  for  MPM),  the  time  between  every  pair  of 
consecutive  steps  of  p  is  C2,  and  the  delay  of  every 
message  is  ^2-  Thus  C2  and  are  “known”  to  the 
processes  and  can  be  used  in  algorithms. 

Asynchronous  In  every  timed  computation,  every 
process  takes  an  infinite  number  of  steps  and  every 
message  is  eventuaJly  delivered. 

Periodic  There  exists  a  constant  d^  such  that  in 
every  timed  computation,  for  every  pi  in  P  (pi  in 
R  for  MPM),  there  exists  constant  Cj  such  that  the 
time  between  every  pair  of  consecutive  steps  of  p,- 
is  Cj,  and  the  delay  of  every  message  is  in  [0,d2]- 
Thus  the  Cj’s  are  unknown  but  d^  is  known. 

Semi-Synchronous  There  exist  constants  ci  >  0, 
C2  and  ^2  such  that  in  every  timed  computation,  for 
every  p  in  P  (p  in  for  MPM),  the  time  between 
every  pair  of  consecutive  steps  of  p  is  in  [ci,  C2]  and 
the  delay  of  every  message  is  in  [0,  ^2].  Thus  Ci,  C2 
and  ^2  are  known. 

Sporadic  There  exist  constants  ci,di,  and  ds  such 
that  in  every  timed  computation,  for  every  p  in  P 
(p  in  R  for  MPM),  the  time  between  every  pair  of 
consecutive  steps  of  p  is  at  least  ci ,  and  the  delay 
of  every  message  is  in  [di,d2]-  Thus  ci,  di,  and  d2 
are  known. 

2.3  The  Session  Problem 

We  now  state  the  conditions  that  must  be  satisfied 
for  a  system  to  solve  the  (s,  Ti)-sesswn  problem  (also 
called  an  (s,  n)-session  algorithm). 

(1)  Each  process  in  P  (in  R  for  the  MPM)  has 
a  subset  of  idle  states.  The  set  E  of  steps  of  the 
system  guarantees  that  once  a  process  is  in  an  idle 
state,  it  always  remains  in  an  idle  state. 

(2)  There  is  a  distinguished  set  V  of  n  variables 
called  ports;  T  is  a  subset  of  X  in  the  SMM  and  the 
set  of  buf's  in  the  MPM.  There  is  a  unique  process 
in  P  (in  R  for  the  MPM)  corresponding  to  each 
port,  which  is  called  a  port  process. 

(3)  Let  p  be  a  port  process  which  corresponds 
to  a  port  y.  A  port  step  is  any  step  involving  p 
and  y.  A  session  is  any  minimal  sequence  of  steps 


contuning  at  least  one  port  step  for  each  port  in 
Y .  In  every  admissible  timed  computation,  there 
are  at  least  s  disjoint  sessions  and  eventually  all 
port  processes  are  in  idle  states. 

In  the  timing  models  with  finite  upper  bounds  on 
step  time  and  message  delay,  we  measure  the  run¬ 
ning  time  of  an  algorithm  in  real  time  as  follows. 
An  algorithm  runs  in  time  t  if,  for  every  admissi¬ 
ble  timed  computation,  every  process  is  in  an  idle 
state  by  time  t.  In  the  case  of  the  asynchronous 
and  sporadic  models,  step  time  and/or  message  de¬ 
lay  is  unbounded  (but  finite).  For  these  cases,  we 
measure  the  running  time  in  rounds  [14,  2,  4].  A 
round  is  a  minimal-length  computation  fragment  in 
which  every  process  appears  at  least  once.  An  algo¬ 
rithm  runs  in  r  rounds  if,  in  every  admissible  timed 
computation  C,  the  prefix  of  C  before  all  processes 
are  idle  consists  of  at  most  r  disjoint  rounds.  It 
is  also  informative  in  these  models  to  express  the 
time  complexity  of  an  algorithm  in  terms  of  a  new 
parameter  7,  the  largest  step  time  during  the  com¬ 
putation  of  the  algorithm  before  all  the  processes 
are  idle.  The  values  of  7  is  dependent  on  a  par¬ 
ticular  computation  of  the  algorithm.  This  type  of 
per-computation  based  time  complexity  measure  is 
also  used  in  [1]. 

3  Communication  in  SMM 

In  describing  our  algorithms,  we  use  a  subroutine 
called  broadcast  as  a  generic  operator  for  communi¬ 
cation  in  both  of  the  communication  models. 

In  the  MPM,  the  broadcasting  of  message  m  by 
process  p  is  t^dcen  care  of  by  the  network.  It  takes 
at  most  d2  -f-  C2  time  for  a  message  to  be  received 
by  ail  processes  in  the  MPM. 

However,  the  communication  in  the  SMM  is  con¬ 
strained  by  the  number  of  processes  which  can  ac¬ 
cess  a  shared  variable.  Therefore,  broadcasting  in 
the  SMM  involves  relaying  messages  from  process 
to  process  by  means  of  shared  variables. 

In  a  b-bounded  shared  memory  system,  we  can 
build  a  tree  networks  of  processes  and  shared  vari¬ 
ables  by  making  port  and  port  processes  the  leaves 
of  the  tree.  This  network  can  accomplish  the  neces¬ 
sary  communication  to  propagate  a  peice  of  infor¬ 
mation  originaing  from  a  process  to  all  other  pro¬ 
cesses  in  0(logt  n)  steps. 

In  this  paper,  when  we  say  broadcast  in  the  SMM, 
it  implies  all  the  interaction  of  processes  in  the  tree 
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network  to  accomplish  the  broadcasting.  Through¬ 
out  this  paper,  we  only  describe  the  role  of  port 
processes  in  an  algorithm  and  assume  that  broad¬ 
cast  encapsulates  the  interactions  among  port  pro¬ 
cesses  and  other  processes  which  participate  in  the 
tree-network  communication.  In  addition,  we  use 
the  term  “step”  interchangeably  with  “port  step” ; 
when  necessary,  we  make  the  proper  distinction. 

4  The  Periodic  Model 

The  periodic  model  and  the  synchronous  model 
are  similar  in  that  a  process  takes  steps  at  regu¬ 
lar  time  intervals,  yet  they  differ  from  each  other 
in  that  there  is  no  bound  on  the  relative  speed  of 
processes  in  the  periodic  model.  We  first  present 
an  algorithm  A{p)  for  the  (s,  n)-session  problem 
in  this  model  and  then  show  that  for  all  peri¬ 
odic  algorithms  which  solve  the  (s,  n)-session  prob¬ 
lem,  there  exists  a  computation  of  A  which  takes 
at  least  max{s  •  Cmand^}  time  for  the  MPM  and 
max{s,  [logj  nj}  •  Cma*  for  the  SMM. 

Algorithm  A(p):  (This  algorithm  runs  in  the 
MBM  and  the  SMM.)  Each  port  process  accesses 
its  own  port  s  —  1  times  and  at  its  a  —  1th  step, 
broadcasts  the  fact.  It  enters  an  idle  state  after  it 
hears  that  all  other  processes  have  taken  s  —  1  steps 
and  it  has  taken  at  least  one  more  port  step. 

Theorem  4.1  A(p)  solves  the  («,  fi)*session  prob¬ 
lem  in  time  s  •  Cmas  +  d  in  the  MPM  and  time 
s  ■  Cmax  +  O(login)  •  c„ax  »«  Ibe  SMM,  where 
Cmax  =  max{ci  ;  pi  €  P}. 

Theorem  4.2  No  MP  periodic  algorithm  for  the 
(s,n)-sesston  problem  runs  in  time  less  than 
max{s  •Cmor,d2}. 

Theorem  4.3  No  SM  periodic  algorithm  for  the 
{a,n)-session  problem  runs  in  time  less  than 
maX-(s  •  Cmaxi  L^Og2j_i  (2n  —  1)J  •  Cmin}- 

Proof;  Suppose  that  s  ■  Cmax 
>  Llog2j_i  (2n  —  1)J  Cmin-  Since  all  processes  must 
take  at  least  s  steps  to  have  s  sessions,  s  •  Cmax  is 
obviously  the  lower  bound. 

Suppose  that  s  ■  c^ax  <  Liog26-i  (2n  -  1)J  •  c„i„. 
By  way  of  contradiction  we  assume  that  there  ex¬ 
ists  an  algorithm  A  which  solves  the  (s,  n)-8ession 


problem  in  the  periodic  SMM  in  time  Z  strictly  less 
than  l.log2i_i  (2n  —  1)J  Cmin-  We  prove  that  there 
exists  an  infinite  admissible  computation  of  A  that 
contains  less  than  s  scions. 

Let  (of,  T)  be  the  admissible  timed  computation 
in  which  processes  take  steps  in  round  robin  order 
and  each  process’s  ith  step  occurs  at  time  i  ■  Cmin 
Each  consecutive  group  of  steps  for  pi  through  p|p) 
is  a  round.  (Round  t  occurs  at  time  t  •  Cmin  and 
consists  of  the  i  th  step  of  each  process.)  Since 
all  processes  should  enter  idle  states  by  time  Z  in 
a  and  all  the  step  time  periods  are  equal  to  Cmin 
in  (q,T),  there  are  at  most  r  =  [Z/cminJ  rounds 
required  until  termination  in  a. 

We  will  perturb  (a,  T)  in  order  to  get  a  new  ad¬ 
missible  timed  computation  (o',  T').  We  will  prove 
that  there  exists  at  least  one  port  process  in  (o',  T') 
which  enters  an  idle  state  before  auiother  port  pro¬ 
cess  takes  any  step,  resulting  in  an  admissible  com¬ 
putation  that  contains  less  than  s  sessions. 

Fix  amy  port  process  p'  and  change  p's  step  time 
period  to  be  Llog24_i  (2n  —  1)J  •  c^m-  Run  A  with 
this  modified  set  of  processes  to  get  a  new  timed 
admissible  computation  (o',  7”). 

We  define  a  subround  to  be  a  minimal  computa¬ 
tion  fragment  of  o'  that  involves  all  processes  ex¬ 
cept  p'.  A  variable  t»  is  contaminated  in  subround 
k  of  o'  if  there  exists  j  <  k  and  process  p  ^  p'  such 
that  v's  value  in  the  global  state  of  o'  following  p’s 
step  on  subround  j  is  not  equal  to  r’s  value  in  the 
global  state  of  o  following  p’s  step  in  round  j.  We 
define  no  variable  to  be  contaminated  in  subround 
0.  A  process  p  is  contaminated  in  subround  k  of 
o'  if  p  ^  p'  and  there  exists  j  <  k  such  that  in 
subround  j  of  o',  p  etccesses  a  variable  that  is  con¬ 
taminated  in  subround  j.  We  define  no  processes 
to  be  contaminated  in  subround  0. 

Let  P{t)  be  the  set  of  processes  that  are  contami¬ 
nated  in  subround  t,  and  let  V{t)  be  the  set  of  vari¬ 
ables  that  are  not  contaminated  in  subround  f  —  1 
but  are  contaminated  in  subround  t.  Let  Pt  and 
Vt  satisfy  the  recurrence  equations:  Pq  =  Vo  =  0, 
V,  =  2  •  P,_i  +  1,  and  P,  =  (6  -  1)  •  V;  -1-  P,_i. 

Lemma  4.4  |P(<)(  <  Pt  and  |V'(t)|  <  Vt  for  0  < 
t<T,  where  r  =  . 

Proof:  By  induction  on  t.  The  key  points  are 
that  p'  contributes  at  most  one  variable  to  V'(t), 
while  each  contaminated  process  contributes  at 
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most  two.  Also,  in  the  worst  case  a  process  be¬ 
comes  cont2tminated  as  soon  as  possible,  processes 
only  become  contaminated  due  to  the  variables  that 
just  become  contaminated,  and  each  variable  con¬ 
taminates  at  most  6—1  other  processes.  ■ 

Soving  the  recurrence  equation,  we  get 

(26-l)‘-l 
- 2 - • 

Thus  the  total  number  of  processes  that  are  con¬ 
taminated  in  subround  r  is  at  most  n  —  1. 

Since  less  than  n  processes  are  contaminated  in 
subround  r,  at  least  one  port  process  p  ^  p'  is  in  the 
same  state  at  the  end  of  subround  r  in  a'  as  it  is  at 
the  end  of  round  r  in  a  —  an  idle  state.  But  p'  has 
not  taken  a  step  yet.  Thus  (a',  T')  is  an  admissible 
timed  computation  that  contains  less  than  s  ses¬ 
sions.  Contradiction.  (Note  that  logjj.i  (2n  —  1) 
approaches  log^  n  as  6  and  n  increase.)  ■ 

5  Semi- Synchronous  Model 

In  this  section,  we  address  the  upper  and  lower 
bounds  in  the  semi-synchronous  shared  memory 
model.  The  semi-synchronous  algorithm  in  [4]  can 
be  adapted  to  work  in  the  shared  memory  semi- 
synchronous  model  simply  by  replacing  the  commu¬ 
nication  primitives  (send  and  receive)  with  the  ex¬ 
plicit  propagation  of  information  through  the  tree 
network  of  shared  variables  using  the  broadcast  sub¬ 
routine  described  in  Section  2. 

The  proof  of  the  lower  bound  for  the  semi- 
synchronous  SMM  is  rather  complicated,  because 
the  propagation  of  information  relies  on  reading 
and  writing  shared  variables,  and  also  because  com¬ 
putations  constructed  in  the  proof  must  satisfy  the 
real  time  constraints. 

Theorem  5.1  There  is  no  semi-synchronous  al¬ 
gorithm  which  solves  the  {8,n)-session  prob¬ 
lem  in  the  SMM  within  time  strictly  less  than 
L^J  .  Llog6  nj  }  •  C2  •  (s  -  1). 

Proof;  Let  B  =  min{  ,  [logj  nJ  } . 

If  C2  <  2ci ,  then  B  <  I  and  it  is  obvious  that  the 
bound  holds  since  every  process  must  take  at  least 
s  steps  to  have  s  sessions. 


Suppose  C2  >  2ci.  Assume,  by  way  of  contra¬ 
diction,  that  there  exists  a  semi-synchronous  algo¬ 
rithm,  A,  which  solves  the  problem  in  SMM  within 
time  Z  strictly  less  than  B  ■  C2  •  {s  —  1).  Then 
f^/(B.C2)l  <(s-l). 

Let  (a,  T)  be  the  admissible  timed  computation 
in  which  processes  take  steps  in  round  robin  order 
and  each  process’  ith  step  occurs  at  time  »  C2.  Each 
consecutive  group  of  steps  for  pi  through  p\p\  is  a 
round. 

There  are  t  =  r^/c2l  rounds  required  until  ter¬ 
mination  in  O'.  Let  a  =  /Jy,  where  /3  contains  the 
first  t  rounds  of  a. 

Following  the  proof  of  Theorem  1  in  [2],  we  will 
show  that  there  is  a  reordering  /?'  of  y?  that  results 
in  the  same  global  state  as  B  but  that  contains  at 
most  s  —  1  sessions.  Thus  or'  =  /Ty  is  a  computa¬ 
tion  with  at  most  s  —  1  sessions.  We  then  will  show 
how  to  time  the  events  in  a'  to  produce  an  admis¬ 
sible  timed  computation  (a',T')  with  at  most  s  —  1 
sessions,  a  contradiction. 

We  construct  a  partial  order  <p  on  the  steps  in 
0,  representing  dependency.  Let  (t  <p  r  for  every 
pair  of  steps  <t  and  r  in  0,  and  say  r  is  dependent 
on  (T,  if  <T  =  T  or  if  a  precedes  r  in  a  and  ff  and  r 
either  involve  the  same  process  or  involve  the  same 
variable.  Close  <p  under  transitivity.  The  follow¬ 
ing  claim  is  not  difficult  to  prove. 

Claim  5.2  <p  is  a  partial  order,  and  every  total 
order  of  steps  of  0  consistent  with  <p  is  a  compu¬ 
tation  which  leaves  the  system  in  the  same  global 
state  as  0  does. 

Let  0  =  00, 01,..., 0m  where  m  =  \Z/{B  •  02)]. 
E2u;h  0k  (except  possibly  the  last  one)  consists  of 
B  rounds.  Let  yo  be  an  arbitreiry  port  in  Y.  For 
all  i,  1  <  ib  <  s  —  1,  we  show  that  there  exists 
a  port  yk  and  two  sequences  of  steps  4>k  and  V’ti 
such  that  the  following  properties  hold,  (py,  is  the 
corresponding  port  process  to  y,-,  1  <  i  <  s  —  1.) 

(i)  4>k^k  is  a  total  ordering  of  the  steps  in  0k, 
consistent  with  <p. 

(ii) ^t  does  not  contain  any  step  by  process  Py*_, 
which  accesses  yt-i. 

(iii) V’i  does  not  contain  any  step  by  process  py^ 
which  accesses  yt. 

Then  define  0'  to  be 
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Proof;  We  need  to  prove  that  all  steps  in  (/S' ,  T') 
satisfy  the  real  time  constraint  imposed  on  the 
semi-synchronous  model. 

By  the  construction,  no  two  consecutive  steps 
by  a  process  in  the  system  are  closer  than  cj  in 
(/S',  T');  therefore,  the  lower  bound  on  step  time  is 
preserved. 

We  now  show  that  the  maximum  time  between 
any  two  consecutive  steps  of  a  process  is  c^.  Let 
It  and  x'  be  two  consecutive  steps  of  some  process 
p.  First  assume  that  x  and  x'  both  occur  in  0k  for 
some  X  is  the  t-th  step  of  p  in  0t  and  it'  is  the 
i  +  l-st.  If  X  and  x'  are  both  in  4k  or  are  both  in 
4k,  then  either  there  is  no  change  in  their  times  or 
they  are  retimed  to  be  c;  apart. 

Now  suppose  X  is  in  and  x'  is  in  4k-  By 
construction, 

T'(x')-r'(x) 

=  B  ■  Cl  +  Cl 

~  "}  •  *^1  + 

<  j  +  ci- 

Since  c;  <  C2/2,  this  difference  is  less  than  C2. 


For  all  *,  1  <  Jb  <  m,  y*,  and  4k  are  defined 
inductively.  If  there  is  some  port  variable  that  is 
not  accessed  by  any  step  in  0k ,  ^bon  let  y^  be  that 
port,  4k  the  empty  sequence,  and  4k  —  0k-  Oih- 
erwise  (every  port  variable  is  accessed  in  0k),  let 
Tk  be  the  first  step  in  0k  that  accesses  yk~i-  As  a 
consequence  of  a  very  general  result  proved  in  [1], 
there  exists  a  port  variable  yt  such  that: 

(iv)  it  is  not  the  case  that  ri  <r* ,  where  <r*  is 
the  last  step  in  0k  that  accesses  yt. 

We  now  assign  times  (the  mapping  T*)  to  every 
step  in  0k  and  then  let  0i^  be  any  total  ordering  of 
the  steps  in  0k  consistent  with  the  times.  We  then 
define  4k  and  4k- 

•  For  each  process  p  6  P,  if  there  are  some  steps 
of  p  in  0k  which  <Tfc  is  dependent  on,  we  let  x  be 
the  step  that  occurred  last  among  them.  We 
retime  x  and  all  the  steps  of  p  that  happened 
earlier  than  x  such  that  the  first  step  of  p  in 
0k  occurs  at  2ciB{k  —  i)  -1-  cj,  the  next  step 
occurs  Cl  time  later,  and  so  on. 

•  For  each  process  p,  if  there  are  some  steps  of  p 
in  0k  which  are  dependent  on  n ,  we  let  x  be 
the  step  that  occurred  first  among  them.  We 
retime  x  and  all  the  steps  of  p  that  occurred 
later  than  x  such  that  the  last  step  of  p  in  0k 
occurs  at  2ci  Bk,  the  step  before  that  occurs 
Cl  time  earlier,  and  so  on. 

•  All  other  steps  in  0k  are  assigned  the  same 
time  as  they  are  under  T  (the  original  timing). 

Let  4k  be  all  the  steps  that  happened  up  to 
time(<Tk)  including  <Tk,  and  let  4k  be  the  remainder. 

Lemma  5.3  0'  is  constsfenf  with  <p. 

Proof:  For  any  fc,  1  <  ib  <  s  —  1,  pick  any  two 
steps,  X  and  x'  in  0k  such  that  s  x'.  Thus 
'^(x)  <  T(x').  (Recall  that  T  is  the  original  tim¬ 
ing.)  We  only  need  to  prove  that  r'(x)  <  T'(x'), 
where  T'  is  the  new  timing. 

Each  of  X  and  x'  belongs  to  either  4k  or  4k  and 
is  either  retimed  or  not.  If  x  is  retimed  in  4k  and 
x'  is  not  retimed  in  4k ,  then  T'(x)  <  T'(x')  since 
T'(x)  <  T(x)  and  T'(x')  >  T(x').  All  other  cases 
can  be  proved  similarly.  ■ 

Lemma  5.4  {0',T')  is  admissible. 


Now  suppose  X  occurs  in  0k-i  and  x*  occurs  in 
0k-  In  the  worst  case,  x  is  retimed  to  occur  at 
jr  ~  C2/2  and  x'  is  retimed  to  occur  &t  x  +  C2/2, 
where  x  is  the  time  at  the  end  of  0k- 1-  (This  is 
true  by  the  definition  of  B.)  So  the  time  between 
X  and  x'  is  at  most  C2.  ■ 


Lemma  5.5  is  true  because  of  the  way  4k-i  and  4k 
are  defined.  The  theo’em  now  follows.  ■ 


6  The  Sporadic  Model 

In  the  MPM,  a  lower  bound  ci  on  step  time  and 
lower  and  upper  bounds  [di,d2]  on  message  delay 
are  imposed.  The  correctness  of  our  sporadic  algo¬ 
rithm  A(sp)  depends  on  the  following  observation: 
If  a  process  pi  receives  a  message  m  from  a  process 
Pj  at  time  t,  then  the  message  must  have  been  sent 
no  later  than  t~di,  because  it  takes  at  least  di  time 
for  a  message  to  be  delivered.  All  the  messages  re¬ 
ceived  by  Pi  after  t  +  di  —  di  must  have  been  sent 


Lemma  5.5  /?'  contains  less  than  s  sessions. 


198 


after  m  was,  because  it  takes  at  most  dj  time  for  a 
message  to  be  delivered. 

Using  the  above  fact,  each  process  broadcasts  a 
message  at  every  step  carrying  its  knowledge  on  the 
number  of  sesssions  happened  by  the  time  that  the 
step  occurs.  After  receiving  a  message  m  which 
says  there  are  at  least  ib  —  1  sessions  in  the  system, 
a  process  waits  for  d2  —  di  time.  After  that,  the 
process  waits  to  receive  at  least  a  message  from 
every  process,  it  is  clear  that  there  are  at  least  k 
sessions  in  the  system  by  the  time  because  every 
message  received  after  t  +  d^  —  di,  where  t  is  the 
time  that  m  is  sent,  must  have  been  sent  after  the 
time  there  are  at  least  Jb  —  1  sessions. 

We  first  proceed  by  presenting  the  algorithm 
A{sp).  A  message  is  denoted  m(i,  V),  where  i  is 
the  identifier  of  a  sending  process  p<  and  V  is  an 
integer  in  [0, »  —  1].  We  let  *  be  a  don’t  care  value 
for  either  of  the  fields  and  u  =  d2  —  d^. 

A(sp)  for  process  pi\ 

B-.=  L^J  +  1; 

count  :=  session  :=  0; 
msgJbuf  :=  tempJbuf  :=  0; 
while(  session  <  s  —  1  ) 

read  6u /{  and  let  the  set  of  messages 
obtained  be  Af; 
msgJbuf  :=  msg-bufU  M; 
if  for  all  j  €  [n],  m(j,  session)  is  in  msgJbuf 
then  /*  condition  1  */ 

count  :=  0; 

session  ;=  session  +  1; 
elsif  (count  >  B) 
then 

tempJbuf  :=  tempJbuf  U  Af  ; 
if  for  all  j  €  [n],  at  least  one  m(j,  *) 
is  in  tempJbuf 

then  /*  condition  2  */ 

count  :=  0; 

session  ;=  session  +  1; 
temp-buf  :=  0; 

end  if; 
end  if; 

broadcast  m(i,  session); 
count  :=  count  +  1; 
end  while; 

Enter  an  idle  state. 

Theorem  6.1  A(sp)  solves  the  (s,  n)~session  prob¬ 
lem  within  time 

"'•"f  +  IJt  +  («  +  27). d2  +  7}(»  -  2)  +  dj  +  27  . 


Proof:  Consider  an  arbitrary  admissible  timed 
computation  C  of  A(sp).  The  following  lemma 
(proof  omitted)  can  be  used  to  prove  the  theorem. 

Lemma  6.2  There  exists  at  least  one  step  C  in 
which  a  process  sets  its  session  to  k,  0<ib<s— 1. 

Let  p,-„  be  the  first  process  which  sets  sessiom^ 
to  k  >  0.  To  increment  session,  a  process  must 
receive  a  message(s)  which  notifies  the  process  that 
there  is  at  least  one  session  after  the  last  update 
to  session.  Let  Aft  be  the  set  of  messages  re¬ 
ceived  by  p,-„  that  causes  p,-,^  to  set  session,,^  to 
k.  (We  define  Afo  to  be  the  empty  set.)  In  more 
detail;  If  condition  1  was  true,  Aft  is  the  set  of 
message  m(j,  8essioni„_, )  for  all  integers  j  €  [n]  in 
msgJbu f.  If  condition  2  was  true.  Aft  is  the  set  of 
messages  in  tempJbu f  at  the  time.  Assuming  that 
mt  is  the  message  which  is  sent  last  among  Aft ,  we 
prove  the  following  lemma. 

Lemma  6.3  Let  v  be  the  step  which  sends  mt. 
There  are  at  least  k  sessions  by  the  time  ir  occurs 
in  C. 

Proof:  We  proceed  by  induction  on  k. 

For  the  basis,  when  ib  =  0,  it  is  always  true  that 
there  are  at  least  0  sessions  in  C. 

Inductively  when  ib  >  0,  assuming  the  lemma  is 
true  for  ib  —  1,  we  show  that  when  ir  occurs,  there 
are  at  least  k  sessions. 

Let  r  be  the  step  that  sent  mt-i  and  <r  be  the 
step  in  which  Pi^_,  sets  se8sioni„_^  to  ib  —  1.  For 
Pi^  to  update  its  session,  one  of  conditions  1  and 
2  in  the  algorithm  must  hold. 

First,  assume  that  condition  1  is  true.  Accord¬ 
ing  to  the  algorithm,  a  process  broadcasts  a  mes¬ 
sage  with  a  new  session  value  ib  —  1  after  it  sets  its 
session  to  the  new  value  ib  —  1,  before  which  time 
there  were  k  —  1  sessions  in  C  because  the  induction 
hypothesis  dictates  that  there  were  ib  —  1  sessions  in 
C  when  mt-i  was  sent.  Because  <r  is  the  first  step 
to  set  session, to  ib— 1,  all  messages  in  Mt,  must 
have  been  sent  when  or  after  0  occurs.  Because  all 
processes  tidce  at  least  one  step  to  send  messages  in 
Mt  after  there  were  at  least  ib  —  1  sessions,  there 
must  be  at  least  k  sessions  in  C  by  the  time  that 
message  mt  is  sent. 

Second,  assume  that  condition  2  is  true.  Since 
p,-„  is  the  first  process  which  sets  session,,,  to  k, 
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sessiorii^  must  have  taken  on  i  —  1  according  to 
the  proof  of  Lemma  6.2.  Let  t  be  the  time  when  t 
occurs  at  which  time  mt-i  was  sent  and  t'  be  the 
time  that  pi^  sets  sessioui^  to  ik  —  1.  The  message 
mt_i  must  arrive  at  at  time  between  [t  + 

+  <^2]  because  of  the  bounds  on  message  delay. 
Thus,  t'  —  t  >  di-  Since  count  in  the  algorithm 
is  reset  whenever  session  is  updated,  when  countj^ 
is  equal  to  B  most  recently  before  when  p,*  sets 
sessioui^  to  k,  say,  at  time  t",  at  least  time  B  c\> 
u  must  have  elapsed  since  t'.  So,  t"  >  t'  +  u  > 
t  +  d.  Therefore,  all  messages  tc'-eived  at  i"  or 
later  must  be  sent  after  time  t,  at  which  time  there 
were  k  —  1  sessions  by  the  assumption.  Since  at 
least  one  message  is  sent  by  each  process  after  time 
t,  there  must  oe  at  least  one  additional  step  by  all 
processes  between  time  t  and  the  time  it  occurs. 
Therefore,  there  must  be  at  least  k  sessions  by  the 
time  It  occurs.  ■ 

To  analyze  the  time  complexity  of  the  algorithm 
A{sp),  we  use  the  actual  maximum  step  time  7 
since  in  our  sporadic  model  the  upper  bound  on 
the  step  time  is  not  available. 

We  define  for  each  ik,  2  <  Jk  <  s—  1,  T*  =  max{t  : 
Pi  sets  sessiom  to  k  at  time  <  in  C  for  all  p<  €  iZ}- 

Lemma  6.4  For  each  k,  2  <  k  <  s  —  I, 

Tk+i  <  Tk  +min{L^  +  1J)7  +  («  +  27),  dj  +  7}- 

Proof:  According  to  the  algorithm,  a  process 
broadcasts  a  message  at  every  step.  Thus,  if  pro¬ 
cess  Pi  receives  a  message  from  process  py  at  time 
t,  it  will  receive  at  least  one  more  message  from  py 
by  time  <  t  +  u  +  2j.  Let  pi^  be  the  last  process 
to  set  session, to  k  and  Pi*4,  be  the  last  process 
to  set  sessioni^  to  ik  -1-  1.  We  now  look  at  each  of 
the  possible  cases  which  may  cause  sessiom^^^^  to 
be  updated  to  ^  1: 

If  condition  2  is  true  when  session,  is  updated 
to  ik  -f  1,  pi^^,  has  made  at  least  B  =  [7  +  IJ 
steps  since  the  last  update  to  session, .  Because 
a  process  must  wait,  since  then,  at  most  u  -b  27 
time  to  receive  another  set  of  messages  sent  from 
all  processes,  at  most  ([7  +  1J)7  +  (“  +  27)  time 
has  elapsed. 

If  condition  1  is  true  when  sessionj^^,  is  updated 
to  ik  -I-  I ,  let  t  be  the  time  at  which  p,^  broadcasts 
m(ijt,ik);  note  that  by  definition,  t  =  Tk.  Message 
m(i,  k)  must  be  received  by  pi^,^,,  by  time  <-bd2-b7. 


Since  both  conditions  take  at  most  min{[7  -f 
IJ7  -b  (u  -b  27), da  -b  7}  time  to  be  true  since  the 
last  update  to  session,  the  lemma  follows.  ■ 

From  Lemmas  6.2  and  6.3,  it  follows  that  there 
are  at  least  s  —  1  sessions  at  the  time  that  m,_i 
is  sent.  All  processes  will  eventually  set  their 
session’s  to  s  —  1  (  since  session  can’t  be  bigger 
than  s  —  1).  Each  process  sets  session  to  s  —  1 
because  it  receives  a  certain  message.  Therefore, 
there  is  at  least  one  additional  step  by  all  processes 
after  there  have  been  s  —  1  sessions  in  C.  Thus, 
there  are  at  least  s  sessions  in  C. 

By  the  algorithm,  initially  it  takes  at  most  d^+^'y 
to  receive  at  least  one  message  from  all  processes 
in  order  to  accomplish  the  first  session.  Using 
Lemma  6.4,  it  is  clear  now  that  it  takes  at  most 
min{ -b  IJ7  +  (u  +  27),  da  -b  7}(s  -  2)  -b  da  -b  2y. 
(This equals  min{ (L^J+3)-7-bu,d2-b7}(s— l)-b7 
ifdi<L^-blJ-7)-'  ■ 

We  now  prove  the  lower  bound. 

Theorem  6.5  No  sporadic  algorithm  solves  the 
(s,n)-session  problem  in  the  MPM  within  time  < 
max{[5^J  •  A',ci}(s-  1)  where  K  = 

Proof:  The  general  structure  of  this  proof  follows 
that  of  Theorem  5.1. 

Let  B  =  L^J. 

When  B  K  <  Ci ,  the  lower  bound  holds  because 
a  process  must  execute  at  least  s  steps  to  achieve  s 
sessions. 

Suppose  that  B  K  >  ci.  Assume,  by  way  of  con¬ 
tradiction,  that  there  exists  a  sporadic  algorithm, 
A,  which  solves  the  (s,  n)-session  problem  in  the 
MPM  within  time  Z  strictly  less  than  B  /Z’  (s  —  1). 
Then  {Z/(B  ■  K)]  <  (s  —  1).  We  show  that  there 
exists  an  admissible  timed  computation  of  A  which 
does  not  include  s  sessions. 

Let  (a,  T)  be  the  admissible  timed  computation 
in  which  regular  processes  take  steps  in  round  robin 
order  and  each  process’  ith  step  occurs  at  time  t  • 
K,  and  all  message  delays  are  exactly  da.  Each 
consecutive  group  of  steps  for  pi  through  p„  is  a 
round. 

Therefore,  there  are  r  =  \Z/K'\  rounds  required 
until  termination  in  a.  Let  a  =  By,  where  /?  con¬ 
tains  the  first  r  rounds  of  a. 
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We  will  show  that  there  is  a  reordering  ^  of  0 
that  contains  at  most  s  —  1  sessions.  Thus  a'  = 
is  a  sequence  with  at  most  s— 1  sessions.  In  order  to 
get  an  admisssible  computation  P',  we  will  assign 
new  times  (T')  to  every  step  in  P  and  let  P^  be  any 
total  ordering  of  the  steps  in  P  consistent  with  the 
times,  and  then  we  will  prove  that  {P',T')  is  an 
admissible  timed  computation  which  results  in  the 
same  global  state  as  p.  A  contradiction. 

Then  we  will  show  how  to  reorder  P  to  produce 
admissible  timed  computation  (a',  T')  that  results 
in  the  same  global  state  as  a. 

Let  P  =  Pi  ..  .Pm  where  m  =  \Z/(B  ■  AT)].  Each 
Pk,  I  <  k  <  m  (except  possible  the  last  one)  con¬ 
sists  of  B  rounds,  and  for  some  sequence  iq, »i  ■  .-im 
of  integers  in  [1,  n],  each  computation  fragment  pk 
consists  of  4>k'*l>k  such  that; 

(i)  <f>t  does  not  contain  any  step  by  process 
Pik-i- 

(ii)  V**  does  not  contain  any  step  by  process  p,» . 

Then  define  p'  to  be  •  •  ■<t>m^m-  As  in 

Lemma  5.5,  we  prove  that  P'  has  at  most  s  —  1 
session. 

Lemma  6.6  P'  has  at  most  «  —  1  sessions. 

Since  all  processes  in  7  are  in  idle  states,  o'  has 
at  most  s  —  1  sessions. 

We  need  to  show  how  to  reorder  every  step  in 
P  to  get  an  admissible  timed  computation  {a',T') 
which  preserves  properties  (i)  and  (ii),  and  results 
in  the  same  global  state  as  o. 

Let  us  first  assign  times  (a  new  mapping  T")  to 
every  step,  ir,  in  P  including  all  the  steps  of  the 
network  N  such  that  T"{ir)  =  T{ir)  •  That 
is,  every  process  except  N  takes  a  step  at  every 
2ci  and  each  round  occurs  at  every  2ci.  Since  the 
delivery  steps  of  N  are  also  remapped,  the  message 
delay  is  reduced  to  dj  •  ^ 

C  =  (/?,  T")  is  an  admissible  timed  computation 
because  /?  is  a  computation,  and  the  step  times  and 
message  delays  obey  the  sporadic  time  constraint. 

From  C,  we  construct  {P' ,T'),  an  admissible 
computation  which  results  in  the  same  global  state 
as  P"  (and  P).  We  map  T"  to  T'  in  order  to  get  »*, 
*t-i>  4>k  nnd  Vt  for  all  I;  ,  1  <  ib  <  m. 

For  all  k,  choose  it  arbitrarily,  as  long  as  it  ^ 
i*_i.  For  all  0  <  j  <  m,  let  tj  =  B  ■  2ci  ■  j.  tj  is 
equal  to  the  ending  time  of  Pj  in  C. 


1.  Let  T  be  all  steps  of  p,,,  and  all  the  steps  of  N 
that  deliver  messages  to  pi^  in  pk-  Retime  ir 
such  that  T'(ir)  =  tk-i  +  {T"(ir)  —  tk-i)/2. 

2.  Let  a  be  all  steps  of  Pi„_,  and  all  the  steps  of 

N  that  deliver  messages  to  in  pk-  Retime 
a  such  that  =tk  —  (t*  —  T"(<t))/2. 

3.  All  other  steps  in  Pk  are  assigned  the  same 
time  as  they  are  under  T". 

Fix  up  the  states  of  the  network  in  ^  so  that  the 
state  of  the  network  is  consistent  with  all  the  send 
and  receive  steps  of  regular  processes  in  p.  Let  4>k 
be  the  prefix  of  P/^  up  to  the  last  step  of  pi^ ,  and  let 
xpk  be  the  remainder.  Let  be  any  total  ordering 
consistent  with  T'. 

We  now  prove  the  following  lemma: 

Lemma  6.7  (P',T")  is  an  admissible  timed  com¬ 
putation  which  results  in  the  same  global  state  as 

P. 

Proof:  The  time  period  of  /?*  in  C'  ,  1  <  F  <  m, 
is  equal  to  to  B2ci  since  Pk  in  C  consists  of  B 
rounds  (except  the  last  one)  and  the  step  time  is  ci . 
Since  no  step  is  retimed  outside  the  time  boundary 
of  {Pk ,  T'),  the  time  period  of  {pk ,  T")  is  also  equal 
to(/?i.T"). 

In  {P,T"),  the  message  delay  of  all  messages  is 
bigger  than  B2ci  by  the  definition  of  B,  so  that 
the  messages  sent  in  Pk  are  never  received  in  Pk  ■  In 
{P’,T'),  the  messages  sent  in  are  never  received 
in  P'^  too  because  no  step  is  retimed  outside  the  the 
time  boundary  of  {Pk,T"). 

By  the  construction,  in  P',  the  delivery  steps  of 
all  messages  are  retimed  with  the  steps  that  receive 
the  messages,  so  that  every  message  is  delivered 
always  before  received.  Every  step  in  P'  receives  the 
same  set  of  messages  as  the  corresponding  step  in  p 
does.  Since  states  of  processes  are  updated  based 
only  on  the  current  state  and  the  set  of  message 
received,  /(?'  is  a  computation  which  leads  to  the 
same  global  state  as  p. 

Now,  to  prove  that  (/S',  T')  is  admissible,  we  need 
to  show  that  {P^  ,T*)  obeys  the  sporadic  time  con¬ 
straints. 

First,  it  is  clear  that  every  computation  step  time 
in  P^  is  bigger  than  the  minimum  step  time  ci  by 
the  construction. 
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Second,  we  need  to  prove  the  delay  of  any  mes¬ 
sage  sent  in  ^  is  within  [d2— «i  da].  For  any  message 
m  sent  in  /?',  let  iri  be  the  step  of  a  process  in  R 
which  sends  m  and  Vj  be  the  step  of  N  which  de¬ 
livers  m.  We  need  to  prove  that  T'[vj)  —  is 

in  [da  —  u,  daj.  Without  loss  of  generality,  assuming 
Ti  is  in  Vi-Kj)  -  r(,r.)  =  r'(T,)  -  r'(jr,)  + 
[ViTrj)  -  -  [r(^.)  -  T«(ir.)]. 

It  can  be  proved  that  for  any  step  rr, 

-u/4  <  T'(ir)  -  T"(ir)  <  u/4. 

-  T"(iri)  =  B  ^ci  <  da  -  u/2  from 
the  construction  of  C'.  Therefore,  it  is  clear  that 
T'(Trj)  —  T'(iri)  is  always  within  [da  —  «,  da]  in  0*. 

m 

Since  there  exists  an  admissible  timed  computar 
tion  (/?',  T")  of  A  which  has  at  most  s  —  1  sessions 
by  Lemma  6.7  and  Lemma  6.6,  this  contradicts  the 
assumed  existence  of  algorithm  A.  Therefore,  The¬ 
orem  6.5  now  follows.  ■ 
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Abstract 

The  problem  of  fault-tolerant  coordination  is  fun- 
dament2tl  in  distributed  computing.  In  the  past, 
researchers  have  considered  two  types  of  coordina¬ 
tion:  general  coordination,  in  which  the  actions  of 
faulty  processors  are  irrelevant,  and  consistent  co¬ 
ordination,  in  which  the  faulty  processors  are  for¬ 
bidden  from  acting  inconsistently.  This  paper  stud¬ 
ies  the  possibility  and  complexity  of  achieving  co¬ 
ordination  in  synchronous  and  asynchronous  sys¬ 
tems  with  crash,  send-omission,  and  general  omis¬ 
sion  failures.  We  indicate  the  systems  in  which  co¬ 
ordination  cannot  be  achieved  and,  when  it  can, 
analyze  the  computational  complexity  of  optimally 
achieving  it.  In  some  cases,  optimum  solutions  can 
be  implemented  in  polynomial  time,  while  in  others 
they  require  NP-hard  local  computation.  These  re¬ 
sults  provide  a  thorough  characterization  of  coordi¬ 
nation  and  will  thus  aid  researchers  in  determining 
the  approach  to  take  when  attempting  to  achieve 
fault-tolerant  coordination. 

1  Introduction 

Coordinating  the  activity  of  the  processors  in  a  dis¬ 
tributed  system  is  a  fundamental  problem  in  dis¬ 
tributed  computing.  In  such  problems,  processors 
are  required  to  agree  on  a  common  action  to  per- 
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form  and  to  ensure  that  the  action  chosen  is  valid 
given  the  context  within  which  they  are  operating. 
This  paper  specifically  considers  favlt-tolerant  co¬ 
ordination  and  assumes  that  some  (but  not  all)  of 
the  processors  in  a  system  may  be  faulty.  Fault- 
tolerant  coordination  requires  that  the  nonfaulty 
processors  successfully  coordinate  their  actions  de¬ 
spite  the  failures  of  others.  There  is  a  large  body  of 
literature  within  computer  science  that  has  studied 
fault-tolerant  solutions  to  coordination  problems, 
such  as  Reliable  Broadcast  and  Distributed  Con¬ 
sensus  (Fischer  [3]  provides  a  survey  of  many  such 
problems). 

This  paper  considers  a  broad  range  of  coordina¬ 
tion  problems  and  divides  them  into  two  classes. 
The  first  is  the  class  of  the  general  coordination 
problems.  These  require  agreement  only  among 
the  correct  processors.  Most  coordination  problems 
considered  in  the  literature  are  in  this  class.  The 
second  class  is  that  of  coruistent  coordination  prob¬ 
lems.  When  a  faulty  processor  performs  an  action, 
these  problems  constrain  it  to  perform  one  that  is 
consistent  with  those  performed  by  the  correct  pro¬ 
cessors. 

This  paper  explores  solutions  to  such  problems 
within  several  types  of  distributed  systems.  We 
consider  systems  with  both  synchronous  and  asyn¬ 
chronous  message  passing.  We  also  consider  sys¬ 
tems  with  crash  (stopping),  send-omission,  and 
general  (send-receive)  omission  failures.  For  each 
system,  our  goal  is  to  determine  the  cases  in  which 
such  problems  can  be  solved  and,  in  those  cases,  to 
find  the  best  possible  solutions.  These  are  solutions 
that  are  optimum  in  the  number  of  rounds  of  com¬ 
munication  needed.  For  such  solutions,  we  analyze 
the  time-complexity  of  the  local  computation  they 
require. 

The  results  in  this  paper  are  based  on  the  rela¬ 
tionship  between  coordination  and  different  forms 
of  processor  knowledge  [8].  It  is  well-established 
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that  knowledge  can  be  used  to  characterize  and  im¬ 
prove  solutions  to  various  problems  in  distributed 
computing  [1,7,9,10,11’]'  example,  Moses  and 
Ibttle  [12]  showed  that  a  weak  form  of  common 
knowledge  is  necessary  for  the  solution  of  general 
simultaneous  coordination  problems  and  used  this 
fact  to  derive  optimum  solutions  to  such  problems. 
Neiger  and  Tuttle  [15]  subsequently  showed  related 
results  considering  consistent  coordination  and  a 
stronger  form  of  common  knowledge.  In  a  pre¬ 
vious  paper  [13],  we  considered  nonsimultaneous 
coordination  problems  (both  general  and  consis¬ 
tent),  characterized  their  solutions  in  terms  of  sev¬ 
eral  types  of  knowledge,  and  explored  the  space  of 
optimum  solutions. 

In  this  paper,  we  concentrate  on  eventual  com¬ 
mon  knowledge  [8,16].  We  do  so  because  earlier 
work  [13]  showed  that  eventual  common  knowledge 
is  necessary  whenever  coordination  is  achieved. 
Thus,  if  some  form  of  eventual  common  knowl¬ 
edge  is  impossible  in  a  system,  the  same  is  true 
of  the  corresponding  form  of  coordination.  If  test¬ 
ing  for  eventual  common  knowledge  is  computa¬ 
tionally  expensive,  the  same  is  true  of  optimum 
coordination  protocols.  Reasoning  about  eventual 
conunon  knowledge  can  be  difficult  because  its  se¬ 
mantic  structure  is  more  complex  than  that  of  sim¬ 
ple  conunon  knowledge;  Tuttle  [16]  showed  that 
its  semantics  correspond  to  an  optimal  strategy  in 
game  theory.  However,  we  have  been  able  to  show 
that,  for  the  cases  we  consider,  eventual  common 
knowledge  is  closely  related  to  distributed  knowl¬ 
edge  [2,8],  which  is  simple  knowledge  ascribed  to 
a  group  rather  than  a  single  processor.  Reasoning 
about  distributed  knowledge  is  relatively  easy,  and 
it  is  by  using  distributed  knowledge  that  we  prove 
many  of  the  results  of  this  paper. 

The  important  contributions  of  this  paper  are  the 
following. 

•  For  some  systems  in  which  as  many  as  half 
of  the  processors  may  fail,  achieving  consis¬ 
tent  coordination  is  impossible  because  the 
necessary  form  of  eventual  common  knowledge 
cannot  be  achieved.  These  systems  are  syn¬ 
chronous  systems  with  general  (send-receive) 
omission  failures  and  asynchronous  systems 
with  send-  or  general  omission  failures.  Intu¬ 
itively,  the  impossibility  results  stem  from  the 
potential  confusion  as  to  the  identity  of  the 
correct  processors. 

•  We  provide  a  thorough  analysis  of  the  relation¬ 
ship  between  eventual  common  knowledge  and 


distributed  knowledge  as  relevant  to  the  prob¬ 
lem  of  achieving  fault-tolerant  coordination. 

•  For  most  systems  with  general  omission  fail¬ 
ures  in  which  consistent  coordination  is  pos¬ 
sible  (i.e.,  in  which  the  correct  processors  are 
always  a  majority),  optimum  solutions  require 
NP-hard  local  computation.  This  is  shown 
by  proving  the  NP-hardness  of  testing  for  dis¬ 
tributed  k '  owiedge. 

Because  these  results  provide  a  thorough  charac¬ 
terization  of  coordination,  they  can  greatly  aid  re¬ 
searchers  in  detercining  the  approach  to  take  when 
attempting  to  achieve  fault-tolerant  coordination. 

2  Definitions 

This  section  defines  a  model  of  a  distributed  sys¬ 
tem.  This  model  is  similar  to  others  used  to  study 
knowledge  and  coordination  [1,8,9,12,15]. 

A  distributed  system  consists  of  a  finite  set  P  of 
n  processors;  they  are  connected  by  a  communica¬ 
tion  network  such  that  any  processor  can  send  a 
message  to  any  other.  All  processors  share  a  clock 
that  starts  at  time  0  and  advances  in  increments 
of  one.^  Before  a  computation  begins,  each  proces¬ 
sor  may  have  an  initial  external  input.  Computa¬ 
tion  then  proceeds  in  a  sequence  of  rounds,  with 
round  r  taking  place  between  time  r  -  1  and  time 
r.  In  every  round,  a  processor  sends  messages  to 
other  processors,  receives  messages  that  have  ar¬ 
rived  since  the  last  round,  performs  some  local  com¬ 
putation  and,  optionally,  a  coordination  action.  At 
any  given  time,  a  processor’s  state  consists  of  the 
time  on  the  global  clock,  the  messages  it  has  sent 
and  received,  its  initial  input,  and  the  coordina¬ 
tion  actions  it  has  performed.  It  is  assumed  that 
all  messages  are  tagged  with  the  times  at  which 
they  were  sent  and  received  and  that  coordination 
actions  tagged  with  the  time  when  they  were  per¬ 
formed.  A  global  state  is  a  tuple  (si, . . . ,  s„)  of  local 
states. 

Processors  follow  a  communication  protocol, 
which  specifies  the  messages  a  processor  is  required 
to  send  for  a  round  as  a  function  of  the  processor’s 
local  state  at  the  beginning  of  that  round.  A  mn  is 
a  protocol  paired  with  an  infinite  sequence  of  global 
states,  one  per  round.  An  ordered  pair  {r,l),  where 
r  is  a  run  and  1  is  a  natural  number,  is  called  a  point 

'Note  that  we  are  using  a  round  based  model  of  asyn¬ 
chronous  systems;  that  is,  we  assume  that  processors  have 
access  to  a  global  clock.  This  is  done  purely  for  ease  of  pre¬ 
sentation.  The  full  paper  defines  a  more  general  mode  for 
asynchronous  systems. 
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and  represents  the  system  after  the  first  I  rounds  of 
T.  The  local  state  of  processor  p  at  that  point  is  de¬ 
noted  by  rp(l).  Some  processors  correctly  follow  the 
protocol  and  are  thus  nonfauliy.  Let  M{r)  repre¬ 
sent  the  set  of  processors  nonfaulty  in  nm  r.  Other 
processors  may  be  subject  to  failures.  We  consider 
three  types  of  failures.  These  are  crash  failures  (in 
which  a  faulty  processor  simply  stops,  possibly  in 
the  middle  of  a  round),  send- omission  failures  (in 
which  a  faulty  processor  may  intermittently  omit 
to  send  messages),  and  general  omission  failures  (in 
which  a  faulty  processor  may  intermittently  omit  to 
send  and/or  receive  messages).  We  assume  that  the 
faulty  behavior  in  a  run  is  completely  independent 
of  the  inputs  to  the  processors. 

This  paper  considers  synchronous  systems,  in 
which  processors  messages  are  always  delivered 
in  the  round  in  which  they  are  sent,  and  asyn¬ 
chronous  systems,  in  which  message-passing  is  re¬ 
liable,  but  there  is  no  bound  on  message  delivery 
times.  Within  the  scope  of  asynchronous  systems, 
we  consider  those  in  which  messages  between  two 
processors  are  always  delivered  in  the  order  sent 
(FIFO  communication)  as  well  as  those  in  which 
this  is  not  true. 

This  work  identifies  a  system  with  the  set  of  all 
runs  of  a  conununication  protocol  tmder  a  specified 
set  of  assumptions  about  synchrony  and  failures.  In 
order  to  analyze  systems,  it  is  convenient  to  have 
a  logical  language  to  make  statements  about  the 
system.  A  fact  in  this  language  is  interpreted  to  be 
a  property  of  points:  a  fact  (p  will  be  either  true  or 
false  at  a  given  point  (r,i),  denoted  (r,f)  ^  (p  and 
(r,{)  <p,  respectively.  Fact  p  is  valid  in  a  system 

if  it  is  true  at  all  points  in  the  system.  Although 
facts  are  interpreted  as  properties  of  points,  it  is 
often  convenient  to  refer  to  facts  that  are  about 
objects  other  than  points  (e.g.,  properties  of  nms). 
In  general,  a  fact  is  a  fact  about  X  if  fixing  X 
determines  the  truth  (or  falsity)  of  p. 

3  Coordination  Problems 

This  section  defines  two  classes  of  coordination 
problems.  In  general,  a  coordination  problem  is  a 
finite  set  of  actions  C  =  {ai,...,am}-  Each  action 
has  associated  with  it  an  enabling  condition  oki, 
which  is  a  fact  about  the  input  and  the  identities 
of  the  faulty  processors  (thus,  it  is  a  fact  about 
runs).  The  processors  must  coordinate  to  choose 
a  common  action  that  is  enabled.  Processors  need 
not  perform  their  actions  simultaneously. 

To  solve  a  coordination  problem,  we  augment  a 
communication  protocol  with  a  set  of  functions  that 


tell  a  process  when  to  perform  a  coordination  ac¬ 
tion;  we  call  the  result  an  action  protocol  Actions 
protocols  must  be  such  that  no  processor  performs 
more  than  one  action.  Informally,  an  action  proto¬ 
col  is  correct  if,  in  all  runs,  processors  must  agree 
and  must  choose  an  action  that  is  enabled.  We 
consider  two  classes  of  such  protocols.  In  one  class, 
the  actions  taken  by  the  faulty  processors  are  not 
relevant;  in  the  other,  their  actions  are  subject  to 
the  same  correctness  criteria  as  those  of  the  non¬ 
faulty  processors.  We  call  these  classes  general  and 
consistent,  respectively.  Formally,  a  protocol  gen¬ 
erally  satisfies  C  (or  G-satisfies  C)  if  the  following 
conditions  hold  of  all  runs  of  the  protocol: 

1.  Validity.  If  an  action  is  performed  by  a  non¬ 
faulty  processor,  then  that  action  is  enabled. 

2.  Agreement.  If  a  nonfaulty  processor  performs 
an  action,  then  all  nonfaulty  processors  per¬ 
form  that  action. 

A  protocol  consistently  satisfies  C  (or  C-satisfies 
C)  if  the  above  conditions  hold  with  the  first  oc¬ 
currence  of  the  word  “nonfaulty”  omitted  from 
each.  Earlier  literature  on  coordination  concen¬ 
trated  on  general  coordination  [9,12];  more  recently, 
researchers  have  begun  to  consider  consistent  co¬ 
ordination  [13,15].  Note  that  these  coordination 
problems  can  be  solved  in  asynchronous  systems  in 
spite  of  the  impossibility  results  of  Fischer  et  al.  [4]. 
This  is  because  they  do  not  require  nonfaulty  pro¬ 
cessors  to  perform  an  action  in  every  nm. 

Solutions  to  coordination  problems  are  compared 
by  comparing  their  behavior  in  corresponding  nms. 
Two  runs  correspond  if  they  have  the  same  initial 
input  emd  the  same  pattern  of  faulty  behavior.  Pro¬ 
tocol  Pi  dominates  P2  if,  in  any  pair  of  correspond¬ 
ing  runs,  no  processor  performs  an  action  later  in 
the  run  of  Pi  than  it  does  in  the  run  of  P2.  A 
protocol  is  X-optimum  for  C  (where  X  is  either  G 
or  C)  if  it  X-satisfies  C  and  dominates  every  other 
protocol  that  does  so.  Some  problems  do  not  have 
optimiun  solutions.  For  example,  Moses  and  Tut¬ 
tle  [12]  showed  that  Eventual  Byzantine  Agreement 
does  not  have  an  optimum  solution.  Nevertheless, 
there  are  coordination  problems  for  which  optimum 
solutions  do  exist.  Neiger  and  Bazzi  [13]  gave  a  pre¬ 
cise  characterization  of  these  problems.  They  also 
considered  optimal  solutions  to  coordination  prob¬ 
lems.  Protocol  P  is  X-optimal  for  C  if  it  X-satisfies 
C  and  if  every  P'  that  X-satisfies  C  and  dominates 
P  is  in  turn  dominated  by  P. 
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4  Definitions  of  Knowledge 

Processor  knowledge  was  first  defined  by  Halpern 
and  Moses  [8]  in  the  following  way.  Processor  p 
knows  tp  at  point  {r,l),  denoted  (r,l)  |=  Kp<p,  if 
(r',/)  1=  ^  for  all  runs  r'  such  that  rp(l)  =  rp(l) 
(recall  that  the  global  clock  is  always  part  of  a  pro- 
cessor’s  local  state).  It  is  often  useful  to  condition 
a  processor’s  knowledge  on  its  being  nonfaulty.  We 
say  that  processor  p  believes  p  Up  knows  that,  if  it 
is  in  A/",  y>  is  true.  That  is,  BpV?  =  Kp(p  €  M  ^  p). 
It  is  easy  to  see  that  (r,l)  Bp^j  if  {r',l')  ^  (fi  for 
all  runs  r'  such  that  r'j,(l)  =  rp{l)  and  p  €  A/’(r'). 
Processor  knowledge,  using  the  Kp  operators,  will 
be  used  to  define  strong  notions  of  group  knowl¬ 
edge,  while  processor  belief,  using  Bp,  will  be  used 
to  define  weaker  notions. 

Because  this  paper  deals  with  coordination 
among  the  nonfaulty  processors,  we  are  specifi¬ 
cally  interested  in  the  knowledge  possessed  by  the 
set  M  of  nonfaulty  processors.  Everyone  in  J\f 
knows  (p,  denoted  Eji/tp,  is  defined  as  ApeV^p^- 
All  processors  in  Af  believe  ip,  denoted  is 

Apej^^p'P-  Strong  common  knowledge  (S^^)  is 
equivalent  to  the  infinite  conjunction  A,>i 
while  weak  common  knowledge  i®  equiva¬ 

lent  to  Aoi  AvV- 

Eventual  common  knowledge  [8,16]  relaxes  the 
simultaneity  inherent  in  common  knowledge.  For 
this  reason,  it  is  more  appropriate  in  the  study  of 
problems  that  do  not  require  simultaneous  coor¬ 
dination.  Informally,  a  fact  is  eventual  conunon 
knowledge  to  a  set  of  processors  if  they  ail  even¬ 
tually  know  it,  all  eventually  know  that  all  others 
eventually  know  it,  and  so  on  ad  infinitum.  As  we 
showed  elsewhere  [13],  eventual  common  knowledge 
is  necessary  for  achieving  termination  in  solving  a 
coordination  problem.  The  definition  of  eventual 
knowledge  uses  the  temporal  operator  eventually 
<C>.  We  say  (r,l)  |=  0^  il  and  only  if  (r,l')  ]=  p  for 
some  V  >  1.  Eventual  common  knowledge  is  also 
defined  using  fixed  points.  Strong  eventual  com¬ 
mon  knowledge  of  fact  p  by  set  M,  denoted 
is  the  greatest  fixed  point  of  the  equivalence  X  O 
<^Ej^{p  A  X).^  Weak  eventual  common  knowledge 
of  fact  p  by  set  Af,  denoted  \N^p,  is  the  greatest 
fixed  point  of  the  equivalence  X  O  0Aa/’(¥’AX).  In 
general,  implies  (but  is  not  necessarily  equiva¬ 
lent  to)  the  infinite  conjunction  Ai>i(0EAr)V  and 

^This  is  slightly  stronger  than  the  original  definition  of 
Halpern  and  Moses  [8].  They  defined  eventurd  common 
knowledge  to  be  the  greatest  fixed  point  of  the  equivalence 
X  ApgV  csMes  considered  in  this  pa¬ 

per,  this  definition  is  equivalent  to  that  given  here  for  strong 
eventnal  common  knowledge. 


a  similar  statement  is  true  for  weak  eventual  com¬ 
mon  knowledge.  One  should  note  that  eventual 
common  knowledge  is  weaker  that  simple  common 
knowledge.  It  does  not  require  that  processors  gain 
their  knowledge  simultaneously  or  that  all  levels 
of  knowledge  will  ever  hold  simultaneously.  Even¬ 
tual  common  knowledge  does  not,  in  general,  imply 
“eventually”  common  knowledge. 

If  a  fact  is  eventual  common  knowledge  to  a  set, 
then  all  members  of  the  set  eventually  know  (or 
believe)  this.  That  is,  both  S^p  =>  OEvS^^  and 
W^p  =>  ^AjpW^p  are  valid.  Each  form  of  even¬ 
tual  common  knowledge  satisfies  an  induction  rule 
that  can  be  used  to  show  that  certain  facts  are  even¬ 
tual  common  knowledge: 

•  If  (>E^f{p  A  Ip)  is  valid  in  a  system,  then 
p  =>  S^tp  is  also  valid  in  that  system. 

•  If  ^  (PA/t/{p  A  is  valid  in  a  system,  then 
p  \N^rp  is  also  valid  in  that  system. 

Because  the  set  Af  is  assumed  to  be  nonempty,  it  is 
not  hard  to  see  that  S^p  ^  p  and  ^%p  p  are 
valid  when  p  is  fact  about  nms. 

We  now  present  two  theorems  regarding  the  re¬ 
lationship  between  coordination  and  eventual  com¬ 
mon  knowledge  that  we  proved  in  an  earlier  pa¬ 
per  [13].  The  first  theorem  shows  that  some  form  of 
eventual  common  knowledge  is  necessary  to  achieve 
coordination. 

The  second  theorem  shows  that  any  optimum  co¬ 
ordination  problem  must  perform  an  action  as  soon 
as  some  enabling  condition  becomes  eventual  com¬ 
mon  knowledge; 

Theorem  1:  Let  C  be  a  coordination  problem.  If 
a  protocol  G-satisfies  C,  then,  whenever  processor 
p  performs  action  Oj,  ByW^olbj.  That  is,  p  must 
know  that,  if  it  is  nonfaulty,  oki  is  weak  even¬ 
tual  common  knowledge.  Similarly,  if  a  protocol 
C-satisfies  C,  then,  whenever  processor  p  performs 
action  at,  Kp  S^oki. 

Theorem  2:  Let  C  be  a  coordination  problem.  If  a 
protocol  is  G -optimum  for  C,  then  processor  p  must 
perform  some  action  as  soon  as  BpW^oki  holds  for 
any  action  a,-.  Similarly,  if  a  protocol  is  C-optimum 
for  C,  then  processor  p  must  perform  some  action 
as  soon  as  KpS^oit,-  holds  for  any  action  a,-. 

Together,  these  results  show  that  a  thorough  un¬ 
derstanding  of  eventual  common  knowledge  will 
give  a  better  understanding  of  the  possibility  of 
achieving  coordination  and  of  the  complexity  of 
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optimum  protocols.  Section  5  gives  situations  in 
which  consistent  coordination  is  impossible.  Sec¬ 
tion  6  considers  the  complexity  of  optimtun  coordi¬ 
nation  protocols. 

5  Impossibility  Results  for  Consistent 
Coordination 

This  section  shows  that,  if  n  <  2t,  then  consis¬ 
tent  coordination  can  be  achieved  neither  in  syn¬ 
chronous  systems  with  general  omission  failures 
nor  in  asynchronous  systems  with  send-  or  general 
omission  failures.  This  is  done  by  showing  that  it 
is  impossible  to  achieve  strong  eventual  common 
knowledge  in  those  systems.  This  distinction  be¬ 
tween  general  and  consistent  coordination  occurs 
because  of  the  uncertainty  regarding  the  identity 
of  the  correct  processors.  There  can  be  two  dis¬ 
joint  sets  of  at  most  t  processors  such  that  pro¬ 
cessors  within  a  set  commtmicate  with  no  trouble, 
but  processors  in  different  sets  never  communicate. 
Since  this  behavior  is  consistent  with  either  set  of 
processors  being  the  set  of  faulty  processors,  no  pro¬ 
cessor  can  know  whether  or  not  it  is  faulty.  In  the 
context  of  general  coordination,  this  is  not  a  prob¬ 
lem,  since  the  actions  of  the  faulty  processors  are 
unimportant,  so  each  set  can  simply  behave  as  if  it 
is  the  set  of  nonfaulty  processors.  In  the  context 
of  consistent  coordination,  however,  this  behavior 
is  critical:  because  the  processors  may  be  isolated 
from  important  information  and  not  know  whether 
or  not  they  are  faulty,  the  correct  processors  can 
become  “paralyzed”  and  imable  to  act. 

This  technique  was  first  used  by  Neiger  and 
Toueg  [14]  to  show  the  impossibility  of  certain 
translations  between  systems  with  failiues.  Neiger 
and  Tuttle  [15]  later  used  it  to  show  the  impossi¬ 
bility  of  simultaneous  coordination  in  systems  with 
general  omission  failures  and  n  <  2t. 

We  handle  the  situations  of  synchronous  and 
asynchronous  systems  differently.  This  is  because 
the  results  for  asynchronous  systems  apply  to  send- 
omission  failures  as  well  as  general  omission  fail¬ 
ures.  Intuitively,  such  a  system  may  be  “parti¬ 
tioned”  by  a  combination  of  “delayed”  messages 
and  send-omission  failures. 

5.1  Synchronous  Systenu  with  Genersd 
Omission  Failures 

This  section  considers  a  synchronous  systems  with 
general  omission  failures  and  n  <  2t.  We  say  that 
two  sets  A  and  B  partition  the  set  of  processors  if 
A  and  B  are  nonempty  and  disjoint,  AU  B  =  P, 


and  |il|,|R|  <  t.  Given  a  run  r  such  that  there 
is  complete  (failure-free)  communication  within  A 
and  within  B,  let  r*  be  the  run  identical  to  r  (the 
failure  pattern  is  the  same)  except  that,  for  every 
round  I  >  k,  no  processor  in  B  sends  or  receives  a 
message  to  or  from  any  processor  in  ^4  in  round  I 
of  r*’;  note  that  A  is  the  set  of  nonfaulty  processors 
in  r*.  We  say  that  r*  is  the  result  of  partitioning  r 
into  A  and  B  from  time  k.  Informally,  the  following 
lemma  states  that,  if  a  fact  becomes  strong  even¬ 
tual  common  knowledge  in  r,  then  it  also  becomes 
eventual  common  knowledge  in  r*  for  all  k. 

Lemma  3:  Let  ip  be  a  fact  about  runs  and  let  A 
and  B  be  two  sets  that  partition  the  set  of  pro¬ 
cessors.  If  (rj)  ^  ^ 

k,  there  is  a  j  >  k  such  that,  if  r*  is  the  re¬ 
sult  of  partitioning  r  into  A  and  B  from  time  k, 

Proof:  The  proof  is  by  reverse  induction  on  k. 

For  k  >  I,  the  result  easily  holds  for  any  j  >  k. 
This  is  because  r^  and  r  cannot  be  distinguished 
at  time  I,  so  (r*,{)  ^  Since  ^  is  a  fact 

about  runs,  S^ip  is  stable  and  knowledge  of  it  can¬ 
not  be  lost.  Now  assume  that  the  result  holds  for 
fc  -fl.  That  is,  (p*‘*'\i)  ^  VpeA  for  some 

j  >  k-i-1.  Let  r'  be  a  run  identical  to  except 
that  //(r*)  =  B  and  all  processors  in  A  fail  to  send 
to  those  in  B  in  round  k.  It  should  be  clear  that  r' 
is  a  run  of  the  system:  there  are  no  failures  among 
A  or  among  B  in  ^nd  both  have  size  at  most 
t.  Because  the  system  admits  general  omission  fail¬ 
ures,  simply  change  send  omissions  to  receive  omis¬ 
sions  and  vice  versa.  Since  r'j,{j)  =  rp'^^(j)  for  any 
p  €  .A/^(r*'*'^),  (r',j)  [=  Since  eventual  com¬ 

mon  knowledge  eventually  becomes  known  to  every 
nonfaulty  processor,  there  is  some  j'  >  j  such  that 
{r',j')  [=  VpeB  Now  let  r*  be  a  nm  identi¬ 

cal  to  r'  except  that  Af{r'‘)  =  A  and  all  processors 
in  B  fail  to  send  to  those  in  i4  in  roimd  k.  Note 
that  r*  is  the  result  of  partitioning  r  into  A  and  B 
from  time  k.  Furthermore,  (r*,i')  |=  S^¥>,  using 
the  argument  given  above.  This  means  that  there 
is  some  j"  >  j'  such  that  {r^,j")  Vpex 
completing  the  proof.  □ 

Given  that  we  can  relate  any  nm  with  strong 
eventual  common  knowledge  to  one  in  which  the 
system  is  partitioned  from  time  0  (and  in  which  the 
same  knowledge  holds),  we  can  show  the  following: 

Theorem  4t  If  p  is  a  fact  about  the  input  and 
faulty  processors  that  is  not  valid,  then,  for  every 
failure-free  run  r  and  time  I,  {r,l)  ^ 
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Proof:  The  proof  is  by  repeated  application  of 

Lemma  3,  where  the  input  (and  set  of  faulty  pro¬ 
cessors)  is  manipulated  in  the  partitioned  nms  to 
relate  the  original  run  to  one  in  which  (p  (and 
thus  S^v’)  true.  The  contradicts  the  fact 

that  was  true  in  the  original  run.  Note  that 
failure-free  runs  are  candidates  for  the  application 
of  Lemma  3.  Q 

A  coordination  problem  is  nontrivial  if  none  of 
its  enabling  conditions  is  valid.  It  is  easy  to  use 
Theorem  4  to  show  the  following: 

Theorem  5:  In  synchronous  systems  with  general 
omission  failures  and  n  <  2t,  if  a  protocol  C-solves 
a  nontrivial  coordination  problem,  then  no  action 
is  performed  in  failure-free  runs. 

5.2  Asynchronous  Systems  with  Send-  or 
General  Omission  Failures 

This  section  shows  a  result  analogous  to  that  shown 
in  the  previous  section.  This  result  holds  for  asyn¬ 
chronous  systems  and  includes  send-  as  well  as  gen¬ 
eral  omission  failures  as  long  as  n  <  2t.  Because 
of  this,  we  need  to  reconsider  the  definition  of  a 
partitioning,  which  depended  on  the  possibility  of 
general  omission  failures.  Now,  r*  is  a  result  of 
partitioning  r  into  A  and  B  from  time  fe  if  r*  is 
identical  to  r  except  that,  for  every  round  I  >  k, 
no  processor  in  B  sends  to  any  processor  in  A  in 
round  I  of  r^.  Note  that,  because  messages  need 
not  be  delivered  in  the  round  in  which  they  are 
sent,  there  may  be  partitions  of  a  nm  r  that  differ 
with  respect  to  when  messages  are  delivered. 

The  following  analogue  to  Lemma  3  holds  in  this 
case: 

Lemma  6:  Let  tp  he  a  fact  about  runs  and  let  A 
and  B  be  two  sets  that  partition  the  set  of  pro¬ 
cessors.  If  (r,l)  f=  then,  for  all  k, 

there  is  a  j  >  k  and  an  r*  that  is  the  result  of 
partitioning  r  into  A  and  B  from  time  k  such  that 

Proof:  The  proof  is  identical  to  that  of  Lemma  3 
with  the  following  exception.  When  run  r'  is  con¬ 
structed,  we  cannot  simply  switch  send  and  receive 
omissions,  as  there  are  only  send  omissions.  In¬ 
stead,  let  r'  be  such  that  messages  omitted  from  B 
to  A  arrive  after  time  j.  Again,  r'j,{j)  =  rp'^^(j)  for 
any  p  €  A/’(p*‘'’').  The  same  method  can  be  used 
when  constructing  r*.  □ 


Lemma  6  can  now  be  used  to  prove  an  analogue 
to  Theorem  4,  giving  in  the  end  the  following  result: 

Theorem  7:  In  asynchronous  systems  with  send¬ 
er  general  omission  failures  and  n  <  2t,  if  a  pro¬ 
tocol  solves  a  nontrivial  coordination  problem,  then 
no  action  is  performed  in  failure-free  runs. 

6  Complexity  Results  for  Optimum 
Coordination 

This  section  consider  systems  in  which  coordination 
problems  can  be  solved.  Thus,  it  considers  general 
coordination  as  well  as  consistent  coordination  in 
systems  other  than  those  discussed  in  Section  5.  It 
begins  by  relating  the  two  forms  of  eventual  com¬ 
mon  knowledge  to  another  form  of  knowledge  that 
is  easier  to  reason  about:  distributed  knowledge.  It 
then  considers  the  complexity  of  testing  for  dis¬ 
tributed  knowledge.  In  some  cases,  we  can  show 
that  the  necessary  tests  can  be  performed  in  poly¬ 
nomial  time.  In  others,  we  show  that  these  tests 
are  NP-hard.  Because  of  the  proven  relationship 
between  distributed  and  eventual  common  knowl¬ 
edge,  these  results  also  hold  for  the  implementation 
of  optimum  coordination  protocols. 

The  results  given  in  this  section  apply  primar¬ 
ily  to  coordination  problems  whose  enabling  condi¬ 
tions  are  facts  about  the  input.  Note  that,  if  ^  is  a 
fact  about  the  input,  then  Bptp  o  Kpy».  To  study 
the  complexity  of  coordination,  one  must  define  the 
parameters  that  determine  the  size  of  a  problem. 
We  consider  this  to  be  the  number  of  processors 
in  the  system  and  the  number  of  roimds  that  have 
taken  place. 

The  results  below  apply  to  full  information  pro¬ 
tocols.  These  are  protocols  in  which  processors 
communicate  all  the  information  available  to  them 
every  roimd.  Moses  and  Tuttle  [12]  showed  that 
full  information  protocols  can  be  implemented  with 
pol3momial  size  messages  in  synchronous  systems 
with  the  t3rpes  of  failure  we  are  considering.  This 
generalizes  to  asynchronous  systems  as  follows.  Af¬ 
ter  round  r,  a  labeled  communication  graph  is  a 
graph  with  (r  -f  l)n  vertices  arranged  in  r  -t-  1 
columns  of  n  vertices  each.  Each  row  corresponds 
to  one  of  the  processors.  An  edge  between  vertices 
(pi,ri)  and  (p2,r2)  means  that  the  message  sent 
by  p  at  the  beginning  of  round  ri  is  received  by  q 
at  the  end  of  round  r2  -  1.  Processors  maintain  a 
local  communication  graph  that  includes  its  view 
of  the  execution  and  sends  this  graph  to  all  proces¬ 
sors  in  each  round.  When  one  processor  receives 
the  communication  graph  from  another,  it  updates 
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its  conuntinication  graph  by  forming  the  union  of 
the  two  graphs  and  adding  an  edge  corresponding 
to  the  message  received. 

In  the  case  of  general  coordination  (Section  6.2), 
we  make  the  standard  assumption  that  the  truth 
or  falsity  of  a  fact  can  be  calculated  in  pol]momial 
time  from  the  labeled  communication  graph. 

6.1  Eventual  Conunon  Knowledge  and 
Distributed  Knowledge 

We  begin  by  giving  a  definition  of  distributed 
knowledge  [2,8].  This  is  simply  a  group  knowl¬ 
edge  analogue  to  the  single  processor  knowledge. 
Fhct  <p  is  distributed  knowledge  to  the  nonfaulty 
processors  at  point  (r,i),  denoted  (r,i)  )= 
if  {r',l)  1=  for  all  runs  r'  such  that  r],(l)  =  rp(l) 
for  all  p  €  M(r).  Thus,  the  combined  states  of 
the  processors  nonfaulty  at  (r,  {)  determine  that  ip 
holds.  Because  it  may  be  possible  that  no  proces¬ 
sor  in  A/’(r)  explicitly  knows  (p,  we  say  that  the 
knowledge  is  distributed  among  the  members  of  the 
group.  Note  first  the  following  fact  relating  knowl¬ 
edge,  belief,  and  distributed  knowledge: 

Lemma  8:  If  (p  is  a  fact  about  runs,  then  Kpp  => 
BpDj^(p  is  valid. 

It  turns  out  that  distributed  knowledge  is  closely 
related  to  eventual  conunon  knowledge.  We  begin 
by  noting  that  it  implies  weak  eventual  common 
knowledge  because  we  can  prove  the  following: 

Lemma  9:  For  any  fact  <p  about  runs,  Dji/(p  => 
is  valid  in  any  system  running  a  full- 
information  protocol. 

Proof:  We  will  prove  this  using  the  induction 

rule  for  weak  eventual  common  knowledge.  Specif¬ 
ically,  we  will  prove  that  D//(p  ^  OAa/’(Dv^  A  <p) 
is  valid  in  the  system;  by  induction,  the  desired  im¬ 
plication  will  then  be  valid.  Assume  that  (r,l)  ^ 
Djsr<p.  It  suffices  to  show  that,  for  any  proces¬ 
sor  p  €  Af,  there  is  some  time  V  >  I  such  that 
(r,l')  ^  BplPf/tp  A  <p).  Let  V  be  the  earliest  time 
by  which  p  has  received  a  messi^e  from  each  pro¬ 
cessor  in  Al  that  was  sent  at  time  1  or  later.  Since 
message  passing  is  reliable  between  correct  proces¬ 
sors,  1'  is  guaranteed  to  exist.  Because  (r,  1)  ^  Djipp 
and  tpvsn  fact  about  runs,  (r,l')  |=  by  time  V, 
p  has  all  the  information  available  to  the  nonfaulty 
processors  at  time  1.  It  should  be  clear  that  K.p<p  ^ 
BpOvv’  any  system,  so  (r',1)  ^  Bp{Pt/p  A  (p), 
as  desired.  □ 


We  next  prove  that  belief  or  knowledge  of  even¬ 
tual  common  knowledge  implies  belief  or  knowledge 
of  distributed  knowledge. 

Lemma  10:  Let  ip  be  a  fact  about  the  input.  Then 
the  following  are  valid  in  any  system:  Bp\N^ip  => 
and  KpS^V’  KpDjvV’. 

Proof:  The  proof  is  given  only  for  the  first  case. 

Let  (r,l)  ^  BpW^V?.  Since  ^  is  valid 

for  facts  about  runs,  (r,/)  |=  Bp^.  Since  ip  is 
a  fact  about  the  input,  Bp(p  KpV’  is  valid,  so 
(r,f)  f=  Kp<p.  By  Lemma  8,  (r,i)  )=  BpO^V’.  as 
desired.  □ 

The  converse  of  the  first  implication  follows  from 
Lemma  9.  The  converse  of  the  second  is  valid  for 
all  systems  that  we  consider.  SpecificaUy,  it  holds 
in  all  cases  except  for  systems  in  which  consistent 
coordination  is  impossible  and  we  can  thus  prove 
Lemmas  11  and  12: 

Lemma  11:  Consider  the  execution  of  a  full  infor¬ 
mation  protocol  in  a  synchronous  system  with  crash 
or  send- omission  failures  or  with  general  omission 
failures  and  n  >  2t.  If  (p  is  a  fact  about  runs,  then 
Djt/ip  =>  S^ip  is  valid. 

Proof:  By  the  induction  nile  for  strong  com¬ 

mon  knowledge,  it  suffices  to  show  that  Dj^(p 
0Ea/’(Dw>  A  p).  Assume  that  (r,l)  [=  Con¬ 

sider  now  two  cases: 

•  There  are  only  crash  or  send-omission  fail¬ 
ures.  Because  all  nonfaulty  processors  send 
correctly  in  round  { -|- 1,  all  noncrashed  proces¬ 
sors  will  know  p  by  the  end  of  that  round.  In 
round  1-1-2,  every  nonfaulty  processor  will,  for 
every  other  processor,  receive  a  messi^e  from 
that  processor  or  know  that  it  is  faulty.  That 
is,  (r,l  -(-  2)  1=  E//-  Apep(P  ^  A/"  V  Kp^),  which 
means  (r,l  +  2)  ^  Ej^{Dyp  A  p),  giving  the 
desired  result. 

•  There  are  general  omission  failures  and  n  >  2t. 
Because  all  nonfaulty  processors  communicate 
correctly  in  round  1  -f  1,  (r,/  -I- 1)  ^  Ejpp.  In 
round  1  +  2,  each  nonfaulty  processor  receives 
messages  from  at  least  n  -  t  >  t  processors 
that  know  p.  Since  it  knows  that  at  least  one 
of  these  must  be  correct,  (r,l+2)  f=  Ejt/{Dj^pA 
p),  as  desired. 

In  either  case,  Bf/p  =>  ^Ej^{Dj,^p  A  p)  so,  by  in¬ 
duction,  Dj^p  ^  S^p  is  vaUd.  □ 
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Lemma  12:  Consider  the  execution  of  a  full  in¬ 
formation  protocol  in  an  asynchronous  system  with 
crash  failures  or  with  send-  or  general  omission  fail¬ 
ures  and  n  >  2t.  If  (p  is  a  fact  about  runs,  then 
0/^<p  =>  S^ip  is  valid. 

Proof:  The  proof  begins  by  showing  that 

ODv¥>  is  valid.  We  will  do  this  by  induc¬ 

tion,  showing  first  that  ODjvy>  =►  <>Ev(ODa/’^A^). 
Let  (r,l)  1=  Let  fe  >  1  be  such  that 

(r,  k)  ^  Djtftp  and  let  y  >  fe  be  such  that  every  non- 
faulty  processor  receives  a  message  from  every  other 
that  was  sent  after  time  fe.  Clearly,  (r,  j)  |=  Evv’- 
Now  consider  two  cases 

•  There  are  only  crash  failures.  Consider  any 
p  €  A/’(r).  At  time  J  +  1,  p  knows  that  it  cor¬ 
rectly  sent  messages  to  all  nonfaulty  processors 
after  it  knew  ip.  Since  these  must  be  eventually 
received,  (r,  j  -I-  1)  ^  KpC*Dv¥’-  This  means 
that  (r,J)  \=  OEv(<>Da/'V»  A  tp). 

•  There  are  send-  or  general  omission  failures 

and  n  >  2t.  Let  j'  >  j  be  such  that  every 
nonfaulty  processor  receives  a  message  &om 
every  other  that  was  sent  after  time  j.  Since 
n  >  2t,  each  nonfaulty  processor  receives  by 
time  j  at  least  n  -  t  >  t  such  messages  and 
knows  that  at  least  one  is  from  a  nonfaulty  pro¬ 
cessor.  Thus,  (r,y')  E//- Vpgv  which 

means  that  (r,  j')  ^  EvODjvyJ.  Thus,  (r,l)  |= 

a  ^),  as  desired. 

In  either  case,  =►  A  p)  is  valid 

so,  by  induction,  ^Dji/p  =*>•  S^p  is  valid.  Since 
D//p  ^Dyp  is  valid,  DvV’  ^  valid  as 

well.  □ 

Lemmas  11  and  12  lead  to  the  equivalences  noted 
in  the  following  theorem: 

Theorem  13:  Let  p  be  a  fact  about  the  input. 
Then  BpW^p  BpDjpp  is  valid.  In  any  sys¬ 
tem  in  which  consistent  coordination  is  possible, 
KpS^^  KpDjt/p  is  valid 

Proof:  A  proof  is  given  only  for  the  second  con¬ 
clusion.  Since  =>  is  valid.  Lemma  10 

implies  that  KpS^p  =>  KpD jt/p  is  also  valid.  By 
Lemmas  11  and  12,  Dj^p  is  valid  for  all 

systems  in  which  consistent  coordination  is  possi¬ 
ble.  This  completes  the  proof.  □ 

These  equivalences  are  important  because  they 
show  that  the  belief  or  knowledge  needed  for  co¬ 
ordination  can  be  exactly  expressed  in  terms  of 


distributed  knowledge.  Because  distributed  knowl¬ 
edge  is  easy  to  reason  about,  these  equivalences  al¬ 
low  us  to  analyze  the  complexity  of  optimum  coor¬ 
dination  protocols. 

6.2  The  Complexity  of  Optimum  General 
Coordination 

This  section  and  Section  6.3  both  restrict  their  con¬ 
sider  to  coordination  problems  for  which  the  en¬ 
abling  conditions  are  facts  about  the  input.  The¬ 
orem  13  showed  that,  for  such  facts,  Bp\N^p 
EpDjv/'^.  Theorem  2  showed  that,  in  any  optimum 
protocol  for  general  coordination,  processor  p  per¬ 
forms  an  action  as  soon  as  BpW^p.  Thiu,  we  can 
address  the  complexity  of  optimum  general  coordi¬ 
nation  by  seeing  how  hard  it  is  for  p  to  determine 
BpOv^.  It  turns  out  that  this  can  be  done  rela¬ 
tively  easily. 

Because  ^  is  a  fact  about  the  input,  Bp^  ^  Kpy>. 
As  noted  earlier,  K.pP  =>■  BpZij^p-,  thus,  Bp^: 
BpDf/p.  It  is  obvious  that  BpBj^p  =>  Bpp.  Thus, 
BpD//-^  ^  Bp^,  meaning  that  a  processor  must 
act  as  soon  as  it  believes  p.  This  is  very  easy  for 
facts  about  the  input.  A  processor  simply  examines 
its  local  state  and  determines  whether  or  not  the 
inputs  of  which  it  is  aware  support  p.  This  can  be 
done  in  polynomial  time. 

6.3  The  Complexity  of  Optimum 
Consbtent  Coordination 

Understanding  the  complexity  of  consistent  coordi¬ 
nation  is  not  as  easy  doing  so  for  general  coordi¬ 
nation.  In  these  cases,  processors  need  to  test  for 
knowledge  of,  and  not  belief  in,  distributed  knowl¬ 
edge  of  the  enabling  conditions.  It  turns  out  that, 
in  some  cases,  this  can  be  checked  in  polynomial 
time,  while  in  others  diecking  is  NP-hard.  We  be¬ 
gin  by  considering  cases  in  which  doing  so  is  easy. 

6.3.1  When  Optimum  Consistent 
Coordination  is  Easy 

This  section  shows  that,  in  systems  in  which  fail¬ 
ures  are  easily  detected  or  in  which  they  cannot  be 
detected  at  all,  testing  for  knowledge  of  distributed 
knowledge  can  be  done  efiiciently.  As  will  be  seen  in 
the  next  section,  these  tests  are  NP-hard  in  systems 
with  general  omission  failures,  in  which  the  identity 
of  faulty  processors  is  hard  to  verify.  The  remain¬ 
der  of  this  section  considers  coordination  problems 
in  which  the  enabling  conditions  are  facts  about  the 
input  that  can  be  expressed  in  a  certain  form.  We 
say  that  a  fact  about  the  input  is  taste  if  it  is  of  the 
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polynomial  time. 


form  “p’s  initial  input  is  (not)  v.”  A  formula  ^  is  in 
coi^unctiTo  normal  form  (CNF)  if  it  is  of  the  form 
At  Vj  V’t.i  (where  i  and  j  have  finite  range).  While 
any  fact  about  the  input  can  expressed  in  CNF,  we 
restrict  our  consideration  to  those  that  can  be  so 
expressed  in  length  polynomial  in  the  number  of 
processors  in  the  system.^ 

As  an  example,  consider  a  synchronous  system 
with  send-omission  failures.  Processor  p  knows  pro¬ 
cessor  q  is  faulty  if  and  only  if  p’s  local  state  in¬ 
cludes  indication  of  a  missing  message  that  q  should 
have  sent.  This  can  be  checked  easily  by  inspecting 
p’s  communication  graph.  Suppose  that  p  knows 
of  f  <  t  such  failures.  Recall  that  v?  is  in  CNF; 
that  is,  <p  =  Vj  V’iji  where  each  V’i,;  is  a  basic 
fact  about  the  input.  Whether  or  not  a  processor 
knows  a  basic  fact  can  be  easily  checked  by  inspect¬ 
ing  the  communication  graph.  It  is  not  hard  to  see 
that  KpDjsfip  if  and  only  if  Aj  V;  thus, 
we  can  test  for  each  KpD^  separately.  For 

each  i,  test  as  follows.  Check  to  see  if  \J-  V’t.i  is 
valid;  this  requires  only  checking  to  see  if  the  con¬ 
junction  contains  a  basic  fact  and  its  negation.  If 
the  disjunction  is  valid,  then  KpD//  ^t.i  trivially 
holds.  If  not,  inspect  the  communication  graph  and 
count  the  number  of  processors  (that  are  not  known 
to  be  faulty)  that  know  one  of  the  basic  facts  that 
comprise  the  disjunction.  If  this  number  is  greater 
than  t  -  then  at  least  one  of  these  is  correct  and 
KpDv  Vy  V’t.i  holds;  otherwise,  all  these  processors 
could  be  faulty  and  -iKpOv  Vy  V’t.i' 

In  asynchronous  systems  without  FIFO  commu¬ 
nication,  it  is  impossible  to  ever  detect  that  a  pro¬ 
cessor  is  faulty.  Any  omission  failure,  either  to  send 
or  to  receive,  cannot  be  distinguished  from  a  mes¬ 
sage  that  is  in  transit.  If  communication  is  FIFO, 
then  some  send-omission  failures  can  be  detected 
(if  a  message  arrives  that  is  subsequent  to  one  that 
was  omitted).  In  this  case,  checking  for  knowledge 
of  distributed  knowledge  is  ^ain  easy  using  the 
method  outline  above.  General  omission  failures 
cannot  easily  be  detected  (in  either  synchronous  or 
asynchronous  systems),  and  the  problem  becomes 
NP-hard;  see  the  next  section. 

These  observations  lead  to  the  following  results: 

Lemma  14:  Consider  any  system  with  crash  or 
send-omission  failures  or  any  asynchronous  system 
with  non-FIFO  communication.  Let  (p  he  a  fact 
about  the  input  that  can  be  expressed  in  CNF  in 
polynomial  size.  Then  KpDj^<p  can  be  checked  in 

*The  restriction  on  polynomial  length  makes  sense  only 
under  the  assumption  that  any  coordination  problem  is  pa¬ 
rameterized  by  n  and  t. 


Theorem  15:  Consider  any  system  with  crash  or 
send-omission  failures  or  any  asynchronous  system 
with  non-FIFO  communication.  Let  C  be  a  coordi¬ 
nation  problem  that  has  a  C-optimum  solution  and 
whose  enabling  conditions  are  facts  about  the  input 
that  can  be  expressed  in  CNF  form  in  polynomial 
size.  Then  the  C-optimum  protocol  can  be  imple¬ 
mented  in  polynomial  time. 

6.3.2  When  Consistent  Coordination  is 
Hard 

This  section  considers  the  case  of  synchronous  sys¬ 
tems  with  general  omission  failures  and  n  >  2t.  In 
this  case,  testing  for  KpDj/ip  is  NP-hard  even  when 
(fiis  u  basic  fact  about  the  input.  This  can  be  used 
to  show  that  implementing  optimum  coordination 
protocols  requires  NP-hard  computation. 

This  result  is  similar  to  one  of  Moses  and  Tut¬ 
tle  [12],  who  showed  that  optimum  protocols  for 
general  simultaneous  coordination  required  NP- 
hard  computation;  Neiger  and  Ibttle  [15]  extended 
this  result  to  consistent  simultaneous  coordination. 
However,  note  that,  in  the  cases  considered  here 
(nonsimultaneous  coordination),  the  NP-hardness 
results  apply  only  to  consistent  coordination.  As 
noted  in  Section  6.2,  optimum  protocols  for  many 
general  coordination  problems  can  be  implemented 
in  polynomial  time. 

The  complexity  result  is  achieved  by  giving  a 
polynomial-time  Ibring  reduction  from  clique  to 
testing  for  KpDj^ip.  The  clique  problem  is  the  fol¬ 
lowing:  given  a  graph  G  =  (V,E)  and  constant  k 
(1  <  k  <  jV^I),  is  there  a  clique  in  G  of  size  fc?  This 
problem  is  NP-complete  [5].  The  idea  behind  the 
proof  is  that  the  communication  structure  of  the  set 
of  nonfaulty  processors  in  a  system  forms  a  clique. 
The  following  states  the  result  for  synchronous  sys¬ 
tems. 

Lemma  16:  Consider  a  synchronous  systems  with 
general  omission  failures.  Given  a  processor  p, 
its  local  state,  and  some  fact  ip  about  the  input 
and  faulty  processors,  determining  whether  or  not 
KpD//ip  is  NP-hard. 

Proof:  Here  is  the  high-level  reduction: 

for  t  =  2  to  fe  do 

convert  G  to  a  system  execution 
if  KqDjt/'ip  /*  no  clique  of  size  i  */ 

retum(NO) 

TetuTn(YES) 
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The  following  describes  the  details  of  the  conversion 
and  the  construction  of  the  fact  tp  and  then  proves 
that  the  overall  reduction  is  correct. 

Consider  the  conversion  at  iteration  i.  At  this 
point,  it  is  certain  that  G  contains  a  clique  of  size 
t  -  1.  Consider  a  system  with  n  =  1V|  +  2  proces¬ 
sors  with  as  many  as  t  =  n  -  t  general  omission 
failures.  Thus,  there  must  be  at  least  n  -  t  =  i 
nonfaulty  processors.  For  each  v  €  V,  there  is  a 
corresponding  processor  p^.  In  addition,  there  are 
two  processors  q  and  r.  Let  <p  be  “r’s  initial  input 
is  0.”  The  following  execution  of  a  full-information 
protocol  then  occurs  (only  significant  communica¬ 
tion  is  mentioned).  Processor  r’s  initial  input  is  0. 
In  round  1,  q  sends  to  all  processors  and  r  sends  to 
none.  All  processors  p*  (u  €  V)  send  to  q  and  r 
and  receive  from  q.  If  {v,w)  €  E,  then  p„  and  pu, 
communicate  correctly  in  rotmd  1.  If  (v, 
then  no  communication  passes  between  p„  and  pu, 
in  round  1.  In  round  2,  q  receives  messages  from 
all  processors.  We  claim  that  K,Dv^  at  time  2  if 
and  only  if  there  is  no  clique  in  G  of  size  i. 

Suppose  that  there  is  a  clique  in  G  of  size  i  = 
n  -  t.  Then,  as  far  as  g  can  tell  at  time  2,  the  cor¬ 
responding  processors  might  be  the  only  nonfaulty 
ones  in  the  system,  meaning  that  g  and  r  are  both 
faulty.  Since  g  and  r  are  the  only  processors  that 
g  knows  to  have  any  knowledge  about  r’s  input,  it 
cannot  be  that  g  Ir  ows  that  <p  is  distributed  knowl¬ 
edge  to  the  nonfaulty  processors.  Now  assume  that 
there  is  no  clique  in  G  of  size  i  (recall  that  there 
must  be  a  clique  of  size  i  —  1).  Then,  at  time  2, 
g  knows  that  there  are  at  most  i  —  1  correct  pro¬ 
cessors  corresponding  to  vertices  in  the  graph.  It 
knows  that  r  is  faulty  (as  it  sent  to  no  processors  in 
round  1),  so  g  knows  itself  to  be  correct.  Thus,  since 
g  knows  ip,  it  knows  that  p  is  distributed  knowledge 
to  the  nonfaulty  processors.  □ 

Similar  reasoning  can  be  used  in  asynchronous 
systems  with  FIFO  communication.  The  con¬ 
structed  execution  is  longer  because,  in  such  sys¬ 
tems,  a  message  subsequent  to  an  omitted  message 
must  arrive  before  any  feulure  can  be  detected. 

The  NP-hardness  of  testing  for  distributed 
knowledge  imply  an  NP-hardness  for  consistent  co¬ 
ordination: 

Theorem  17:  Consider  a  system  with  general 
omission  failures  that  is  either  synchronous  or 
asynchronous  with  FIFO  communication.  Proces¬ 
sors  perform  NP-hard  local  computation  in  any  op¬ 
timum  protocol  for  a  consistent  coordination  prob¬ 
lem. 


7  Discussion  and  Conclusions 

This  paper  considered  the  problem  of  fault-tolerant 
coordination  in  distributed  systems.  In  some  cases, 
it  was  determined  that  such  coordination  was  im¬ 
possible.  In  others,  the  computational  complexity 
of  optimum  coordination  protocols  was  analyzed. 
The  main  results  are  given  in  Table  1.  Note  that 
we  provide  no  impossibility  results  for  general  co¬ 
ordination;  furthermore,  optimum  protocols  when 
they  exist,  are  tractable.  For  consistent  coordina¬ 
tion,  in  which  faulty  processors  are  forbidden  from 
taking  inconsistent  actions,  the  situation  is  much 
more  complex.  This  is  especially  true  for  general 
omission  failures.  Here,  consistent  coordination  is 
impossible  if  as  many  as  half  the  processors  may 
fail  (n  <  2t);  if  a  majority  must  remain  correct 
(n  >  2t),  then  the  problems  may  be  solved,  but 
any  optimum  protocol  requires  NP-hard  local  com¬ 
putation.  For  asynchronous  systems,  the  impossi¬ 
bility  result  extends  to  systems  with  send-omission 
failures.  The  complexity  results,  however,  do  not 
hold  unless  message  passing  is  FIFO. 

Some  of  these  results  are  related  to  earlier  results 
shown  for  simultaneous  coordination  [12,15].  How¬ 
ever,  there  are  some  important  differences.  One 
of  these  is  the  case  of  general  coordination  in  syn¬ 
chronous  systems  with  general  omission  failures  and 
n  >  2t.  For  simultaneous  coordination,  optimum 
solutions  in  these  systems  required  NP-hard  lo¬ 
cal  computation  regardless  of  whether  coordination 
was  general  or  consistent.  In  this  paper,  however, 
we  see  that  there  are  polynomial  time  solutions  for 
the  nonsimultaneous  general  coordination  but  not 
for  consbtent  coordination.  This  is  the  first  case 
in  the  literature  in  which  the  computational  com¬ 
plexity  of  optimiun  solutions  depends  on  the  gen¬ 
eral/consistent  distinction. 

Our  results  for  coordination  in  as3mchronous  sys¬ 
tems  apply  to  certain  algorithms  of  Gopal  and 
Toueg  [6].  For  example,  they  present  a  solution  to 
a  problem  they  call  Single  Value  Agreement  that 
tolerates  general  omission  failures  when  n  >  2t.  If 
communication  is  not  FIFO,  then  their  solution  is 
optimal.  If  communication  is  FIFO,  it  is  not;  infor¬ 
mally,  this  is  because  the  knowledge  that  communi¬ 
cation  is  ordered  can  allow  processors  to  decide  ear¬ 
lier.  However,  doing  so  optimally  requires  proces¬ 
sors  to  perform  NP-hard  local  computation.  Gopal 
and  Toueg  did  not  consider  FIFO  communication 
and  presented  only  protocols  that  used  polynomial 
computation. 

The  complexity  results  of  Section  6  apply  to  op¬ 
timum  solutions  to  coordination  problems  whose 
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l^ble  1:  The  Possibility  and  Complexity  of  Coordination 


Synchronous 


Asynchronous 


General  Consistent 


pol3rnomial  time 

polynomial  time 

NP-hard 

impossible 

polynomial  time 

polynomial  time 
polynomial  time 
pol3rnomial  time 
NP-hard 
impossible 

Crash,  Send-Omission 
General  Omission,  n>  2t 
General  Omission,  n  <2t 
Crash 

Either  omission,  n  >  2t,  non-FIFO 
Send-omission  n  >  2t,  FIFO 
General  Omission  n  >  2t,  FIFO 
Either  omission,  n  <2t 


enabling  conditions  are  facts  about  the  input.  This 
was  done  by  relating  knowledge  of  eventual  com¬ 
mon  knowledge  to  knowledge  about  distributed 
knowledge.  For  asynchronous  systems  (without 
FIFO  communication),  these  results  can  be  ex¬ 
tended  to  include  facts  about  failures  as  well.  We 
are  currently  exploring  ways  in  which  such  exten¬ 
sions  can  be  made  for  other  systems.  We  have 
already  established  a  different  relation  between 
knowledge  of  eventual  common  knowledge  and  of 
distributed  knowledge;  we  plan  to  extend  the  re¬ 
sults  of  this  paper  to  coordination  problems  that 
also  depend  on  facts  about  failures. 

Finally,  recall  that  Moses  and  Ihttle  [12]  showed 
that  some  coordination  problems  have  no  optimum 
solution.  In  an  earlier  paper  [13],  we  used  a  new 
form  of  knowledge,  called  extended  common  knowl¬ 
edge,  to  construct  optimal  solutions  to  any  coordi¬ 
nation  problem.  In  the  future,  we  plan  to  explore 
the  complexity  of  this  type  of  knowledge  to  better 
understand  the  complexity  of  these  optimal  solu¬ 
tions  and  thus  complement  the  results  presented 
here  for  optimum  solutions. 
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From  Sequential  Layers  to  Distributed  Processes 

Deriving  a  Distributed  Minimum  Weight  Spanning  Tree  Algorithm  * 

(extended  abstract) 
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Abstract 

Analysis  and  design  of  distributed  algorithms  and  pro¬ 
tocols  are  difficult  issues.  An  important  cause  for  those 
difficulties  is  the  fact  that  the  logical  structure  of  the 
solution  is  often  invisible  in  the  actual  implementation. 
We  introduce  a  framework  that  allows  for  a  formal  treat¬ 
ment  of  the  design  process,  from  an  abstract  initial  de¬ 
sign  to  an  implementation  tailored  to  specific  architec¬ 
tures.  A  combination  of  algebraic  and  axiomatic  tech¬ 
niques  is  used  to  verify  correctness  of  the  derivation 
steps.  This  is  shown  by  deriving  an  implementation  of 
a  distributed  minimum  weight  spanning  tree  algorithm 
in  the  style  of  [GHS]. 

1  Introduction 

Protocols  for  distributed  systems  can  not  only  be  com¬ 
plicated  to  develop  but  even  more  complicated  to  un¬ 
derstand  by  others  than  the  designers.  Such  protocols 
are  often  the  result  of  a  process  of  transforming,  refining 
and  optimizing  a  basically  simple  algorithm.  In  order 
to  explain  and  clarify  the  final  resulting  protocol,  as  op¬ 
posed  to  mere  verification,  the  structure  of  a  correctness 
proof  should  reflect  the  structure  of  the  original  design. 
A  notorious  example  is  the  algorithm  for  determining 
minimum  weight  spanning  trees  by  Gallagher,  Hum- 
blett  arid  Spira  [GHSj.  There  are  several  published  cor¬ 
rectness  proofs  of  the  [GHS]  algorithm  [WLL,  CG,  SR), 
some  of  which  rely  on  a  protocol  structure  for  the  ver¬ 
ification  process  that  differs  from  the  structure  of  the 
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final  algorithm.  Yet  we  feel  that  these  proofs  fall  short 
of  clarifying  certain  relevant  aspects  of  the  [GHS]  algo¬ 
rithm.  In  this  paper  we  identify  such  aspects  and  we 
show  how  each  of  them  can  be  understood  in  a  series 
of  relatively  easy  transformations  where  at  each  step 
only  a  few  new  aspects  are  introduced.  This  leads  to 
a  natural  decomposition  of  our  correctness  proof  that 
has  moreover  the  desirable  property  that  it  closely  fol¬ 
lows  a  (possible)  design  trajectory.  Explanation  in  the 
form  of  systematic  design  allows  for  a  comparison  of  al¬ 
gorithms  by  means  of  a  “genealogy” ;  the  earlier  during 
the  design  that  a  different  design  decision  was  taken, 
the  more  different  the  finally  resulting  algorithms  are. 
This  genealogy  often  suggests  other  algorithms  and  im¬ 
provements.  For  the  [GHS]  algorithm  we  present  a  de¬ 
sign  trajectory  that  starts  with  an  initial  solution  from 
which  algorithms  can  be  obtained  as  divers  as  the  Prim 
and  Dijkstra  adgorithms  [Pri,  Dijk],  Kruskal’s  algorithm 
[Kru],  Boruvka’s  algorithm  ([Bor,  Tar])  and,  indeed,  the 
algorithm  of  [GHS].  Already  at  a  very  early  stage  in  our 
design  trajectory  most  of  these,  except  Boruvka’s,  are 
excluded.  We  thus  obtain  a  variant  where  essentially 
the  time  complexity  claimed  by  [GHS]  is  achieved. 

The  transformational  design  that  we  propose  goes 
from  a  sequential  program  (essentially  Boruvka’s  algo¬ 
rithm)  via  a  sequentially  phased  parallel  program  to  a 
distributed  program.  A  sequentially  phased  parallel  pro¬ 
gram  [SR]  can  be  described  as  a  sequential  composition 
of  a  number  of  layers  [EF,  KP],  each  of  which  is  a  (rel¬ 
atively  simple)  parallel  program.  Many  protocols  for 
distributed  systems  admit  such  a  “layered”  presenta¬ 
tion  which  is  much  easier  to  analyze  than  the  final  dis¬ 
tributed  version.  In  [JPZ]  a  formulation  of  this  principle 
in  the  form  of  an  (^gebraic  transformation  law  has  been 
put  forward. 

In  the  present  paper  we  apply  this  transformation  law 
and  show  that  it  can  be  applied  to  a  situation  as  complex 
as  the  GHS  protocol.  We  do  so  by  systematically  deriv¬ 
ing  a  GHS-like  protocol  in  a  number  of  steps,  starting 
with  a  simple  sequential  Boruvka-like  algorithm,  dis¬ 
tributing  it  over  nodes  and  introducing  optimizations. 
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This  leads  to  a  correctness  proof  in  a  number  of  rel¬ 
atively  simple  steps,  reflecting  decisions  in  the  design 
process. 

By  several  other  authors  it  is  also  argued  that  the 
correctness  proof  should  be  able  to  represent  the  intu¬ 
itive  explanation  given  by  the  protocol  designers.  Chou 
and  Gafni  [CG]  group  classes  of  actions  and  define  a 
sequential  structure  on  such  classes  (so-called  atratifica- 
tion).  In  the  actual  verification  however  they  use  sin¬ 
gleton  classes  and  a  total  order  on  actions  which  does 
not  comply  with  the  abstract  protocol  structure. 

Stomp  and  de  Roever  on  the  other  hand  [SR]  introduce 
what  they  call  a  principle  for  sequentially  phased  reason¬ 
ing  which  allows  them  to  introduce  semantically  defined 
layers  that  should  correspond  to  the  intuitive  ideas  of 
the  designers.  In  [Sto]  this  principle  is  applied  to  the 
derivation  of  a  broadcast  protocol.  The  main  difference 
to  our  approach  that  we  use  a  formulation  of  this  prin¬ 
ciple  in  an  algebraic  setting. 

Both  approaches  are  closely  related  to  the  idea  of 
Communication  Closed  Layers  by  Elrad  and  Francez. 

In  order  to  get  an  idea  of  aspects  of  the  GHS  proto¬ 
col  that  we  can  explain  we  provide  some  detail  of  the 
protocol  as  described  in  [GHSj.  The  protocol  deter¬ 
mines  the  minimum  weight  spanning  tree  (MST)  of  a 
given  connected  undirected  graph  with  N  nodes  and  E 
edges.  A  connected  subgraph  of  the  MST  is  called  a 
fragment;  virtually  all  algorithms  for  determining  the 
MST  start  with  trivial  fragments  in  the  form  of  single 
nodes  and  “grow”  one  or  more  fragments  until  the  com¬ 
plete  MST  has  been  obtained.  The  basic  principle  to 
enlarge  a  fragment  is  to  calculate  its  (uniquely  deter¬ 
mined)  minimum-weight  outgoing  edge;  this  edge  can 
be  shown  to  be  part  of  the  MST.  Two  fragments  can 
be  combined  by  connecting  them  via  edge  e  if  e  is  the 
minimum  weight  outgoing  edge  of  at  least  one  of  those 
fragments.  In  [GHS]  each  fragment  finds  its  minimum- 
weight  outgoing  edge  concurrently  and  asynchronously 
with  regard  to  other  fragments,  and  then  tries  to  com¬ 
bine  with  the  fragment  at  the  the  other  end  of  the  edge 
by  sending  a  “connect”  message.  How  and  when  this 
combination  takes  place  is  quite  intricate  and  must  be 
regarded  as  one  of  the  typical  characteristics  of  the  GHS 
protocol.  It  depends  on  so  called  levels  attached  to  frag¬ 
ments.  Apart  from  single  nodes  which  are  defined  to  be 
at  level  0,  fragments  F  have  a  level  L  >  0  which,  ac¬ 
cording  to  [GHS],  depends  on  previous  combinations. 
We  quote  [GHS]; 

“Suppose  a  given  fragment  F  is  at  level  L  >  0 
and  the  fragment  F'  at  the  other  end  of  F's 
minimum- weight  outgoing  edge  is  at  level  L'. 

If  L  <  V ,  then  fragment  F  is  immediately 
absorbed  as  part  of  fragment  F' ,  and  the  ex¬ 
panded  fragment  is  at  level  L' .  If  L  =  L'  and 
fragments  F  and  F'  have  the  same  minimum- 


weight  outgoing  edge,  then  the  fragments  com¬ 
bine  immediately  into  a  new  fragment  at  level 
T  -t-  1.  In  all  other  cases,  fragment  F  simply 
waits  until  fragment  F'  reaches  a  high  enough 
level  for  combination  under  the  above  rules.” 

One  important  reason  why  it  is  difficult  to  get  a  clear 
intuitive  understanding  of  the  protocol  is  that  various 
fragments  with  totally  different  levels  are  active  at  the 
same  time.  Our  analysis  is  based  on  a  clear  distinction 
between  causal  order  and  temporal  order  of  events.  It 
is  shown  that  the  apparent  “chaos”  from  the  temporal 
point  of  view  corresponds  to  a  highly  regular  pattern 
from  a  causal  order  point  of  view.  Actually,  in  terms 
of  causal  order  the  protocol  is  closely  related  to  Boru- 
vka’s  algorithm,  where  there  is  a  strict  alternation  be¬ 
tween  phases  where  the  minimum  weight  outgoing  edges 
of  all  fragments  are  determined  and  phases  where  frag¬ 
ments  combine  until  no  further  combination  is  possible 
anymore.  Conceptually,  i.e.  from  a  causal  order  point 
of  view,  all  fragments  of  a  given  level  L  are  created 
together,  in  a  single  phase  as  sketched.  From  a  tem¬ 
poral  point  of  view  though,  the  creation  of  fragments, 
by  means  of  creating  a  ‘core’  and  followed  by  ‘absorb¬ 
ing’  other  fragments,  is  spread  out  over  time.  Actu¬ 
ally  it  is  quite  possible  for  a  given  fragment  that  it  is 
already  becoming  paurt  of  some  higher  level  fragment 
before  the  fragment  itself  is  completed!  Within  this  set¬ 
ting  it  now  becomes  possible  to  clarify  an  aspect  such 
as  the  necessity  of  tagging  messages  with  level  numbers 
in  [GHSj:  In  the  intermediate  stages  of  our  design  tra¬ 
jectory  there  are  no  such  tags  or  any  other  (explicit) 
references  to  level  numbers  within  the  prograun  itself. 
However,  in  order  to  apply  our  communication  closed 
layers  law  we  introduce  separate  sets  of  communication 
chamnels,  one  for  each  level  so  as  to  fulfill  the  side  con¬ 
ditions  for  the  law.  After  applying  the  law  we  have  ob¬ 
tained  a  distributed  algorithm  with  still  the  same  sets  of 
chamnels.  In  one  of  the  last  transformation  steps  we  then 
merge  the  chamnels  for  different  levels  between  two  given 
nodes,  by  a  straightforwamd  multiplexing  technique. 

The  fraunework  we  introduce  in  this  paper  allows  to 
formulate  principles  like  communication  closed  layers  in 
a  compositional  and  algebraic  setting.  The  formulation 
of  such  laws  strongly  depends  upon  a  new  type  of  com¬ 
position  operator  called  layer  composition.  It  is  resem¬ 
bles  sequentiad  composition  but  allows  more  paurallelism 
between  actions.  The  definition  of  this  operator  is  given 
in  a  partial  order  model,  but  does  not  depend  on  that. 
It  relies  on  a  symmetric  and  irreflexive  relation  between 
actions,  called  the  conflict  relation,  akin  to  conflicts  in 
distributed  databases  [BHG]. 

The  introduction  of  such  operators  yield  a  language 
that  allows  us  to  follow  the  whole  trajectory  starting 
with  an  initial  design  that  is  free  of  architectural  bias 
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to  the  actual  physical  implementation  where  con8ider2i- 
tions  like  the  number  of  processors  or  nodes  and  the  net¬ 
work  structure.  The  derivation  steps  in  this  trajectory 
cannot  be  done  by  using  algebraic  laws  only.  At  some 
moments  in  the  derivation  process  atate-baaed  and  ax¬ 
iomatic  reasoning  is  needed  to  show  the  correctness.  By 
combining  both  styles  of  reasoning  only,  we  can  bridge 
the  gap  between  abstract  specification  and  low-level  im¬ 
plementation. 

Before  giving  the  derivation  we  first  introduce  the  lan¬ 
guage  used  and  the  underlying  model. 

2  The  design  language  and  its 
properties 

In  this  section  we  present  the  language  used  in  the 
derivation  process  and  some  of  the  properties  needed. 
In  order  to  get  some  intuition  for  the  validity  of  these 
laws  and  properties  we  informally  describe  the  underly¬ 
ing  model,  based  on  runs  consisting  of  a  set  of  events  and 
a  causal  order  and  a  temporal  ordering  relation.  For  a 
more  detailed  treatment  of  the  model  and  the  algebraic 
properties  of  our  language  we  refer  to  [JPZ,  Zwi]. 

The  language  that  we  use  is  intended  to  be  appropri¬ 
ate  for  the  initial  design  stage  -  during  which  we  prefer 
to  have  no  bias  towards  a  certain  network  architecture 
-  as  well  as  for  the  description  of  the  final  program 
that  should  fit  the  network  structure.  The  reason  for 
having  a  single  language  rather  than  two  separate  lan¬ 
guages,  one  for  initial  design  and  another  for  coding  the 
final  program,  is  that  we  aim  at  a  gradual  transforma¬ 
tion  from  initial  design  towards  final  implementation, 
which  requires  a  single  language  that  can  represent  all 
stages,  including  intermediate  ones.  Since  we  introduce 
some  rather  unconventional  language  operators,  which 
are  difficult  to  appreciate  without  a  basic  knowledge  of 
the  underlying  model,  we  start  with  a  sketch  of  the  lat¬ 
ter. 

The  model 

Basically,  we  describe  the  execution  of  distributed  sys¬ 
tems  by  histories  h  that  consist  of  a  partially  ordered  set 
of  events.  This  model  is  related  to  the  pomset  model  as 
introduced  in  [Pratt].  Typical  examples  of  events  that 
we  actually  use  in  this  paper  include  send  and  receive 
actions  and  read  and  write  operations  to  local  or  shared 
memory.  The  precise  interpretation  of  an  event  e  is  de¬ 
termined  by  its  attributes  o(e),  some  of  which  will  be 
mentioned  below.  For  each  system  mauiy  different  his¬ 
tories  are  possible,  due  to  different  behavior  of  the  con¬ 
current  environment  of  the  system  and  other  causes  of 
nondeterminism.  Therefore  a  system  semantically  de¬ 
notes  a  set  of  possible  histories. 


Events  e  and  /  that  are  unordered  in  some  history 
h,  are  ssdd  to  be  independent.  Potentially  such  events 
execute  in  parallel,  i.e.  at  the  same  time  or  at  overlap¬ 
ping  time  intervals.  Within  our  design  formalism  there 
are  two  causes  for  ordering  events  which  consequently 
do  not  execute  in  parallel; 

•  The  first  one  is  because  e  and  /  conflict  in  the  sense 
that  they  both  access  a  common  resource  that  does 
not  allow  simultaneous  access.  The  generic  exam¬ 
ple  (and  the  terminology)  stems  from  conflicts  be¬ 
tween  concurrent  database  actions  [BHG]  due  to 
read  and  write  operations  to  the  same  shared  mem¬ 
ory  locations.  When  this  happens  e  and  /  simply 
cannot  execute  (fully)  in  parallel  and  so  must  logi¬ 
cally  be  ordered,  which  we  denote  as  either  as  e — >f 
or  as  / — *e,  depending  on  which  is  the  case.  Only 
conflicting  actions  are  ordered  logically. 

•  The  second  cause  is  that  actions  are  temporally  or¬ 
dered  as  the  result  of  the  use  of  language  operators 
that  explicitly  require  such  ordering.  Such  opera¬ 
tors  are  typically  used  in  the  last  design  stage  where 
actions  are  actually  allocated  on  specific  processors, 
or  to  specific  network  nodes.  Clearly,  actions  that 
should  run  on  a  single  processors  have  to  be  or¬ 
dered  temporally.  Temporal  precedence  of  e  over  / 
is  denoted  by  e — »/. 

Because  of  the  sharp  difference  between  logical  amd 
temporal  precedence,  conceptually  firom  the  point  of 
view  of  a  designer  as  well  as  from  a  more  technical  point 
of  view,  we  use  a  formal  semantic  model  where  histo¬ 
ries  are  structures  of  the  form  {E,  — — >>),  and  where 
E  is  a  set  of  events,  with  a  dual  ordering  defined  on  it: 
(E,  — >)  is  a  directed  acyclic  graph  (DAG),  i.e.,  the  tran¬ 
sitive  closure  of  — >  is  a  partial  order  on  E.  {E,  — »)  is 
simply  a  partial  order  itself.  The  two  ordering  relations 
are  weakly  related.  Temporal  order  obviously  does  not 
imply  logical  precedence.  If  two  events  e  auid  /  are  log¬ 
ically  ordered,  say  e — >f,  then  they  cannot  be  ordered 
temporally  in  the  reversed  direction,  i.e. 

e — ^ /  implies  /  -f*  e 
Also  the  following  relation  must  hold 
e — *f — *^g — *h  implies  e — *h 
Informally  one  can  think  of  e — *f  as  e  influencing  f 
which  cannot  be  the  case  if  /  completely  precedes  e  in 
time.  Any  stronger  relationship  cannot  be  assumed;  for 
instance  from  database  serializability  theory  there  are 
well  known  examples  of  atomic  transactions  e,  /  and  g 
such  that  e — *f — *g,  yet  g — »e! 

Informal  semantics  and  algebraic  proper¬ 
ties 

The  two  main  composition  operators  of  the  language, 
parallel  composition  and  conflict  composition,  are  de- 
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fined  purely  in  terms  of  logical  precedence,  i.e.  no  tem¬ 
poral  order  is  enforced  by  these  operators. 

The  histories  for  a  parallel  composed  system  S  of  the 
form  Q  II  A  can  be  described  as  follows.  The  events  exe¬ 
cuted  by  S  in  some  history  h  can  be  partitioned  into  sub¬ 
histories  hq  and  Hr  that  are  possible  histories  for  Q  and 
R.  Moreover,  the  logical  precedence  relation  between  hn 
and  hq  is  such  that  all  conflicting  events  are  logically 
ordered,  where  the  direction  of  the  precedences  are  non- 
deterministically  chosen.  This  nondeterminism  is  con¬ 
strained  of  course  by  the  fact  that  logical  precedence  is 
an  order,  so  cycles  of  the  form  cq — ►ci — ■  ■  — ^eo  are 
not  allowed. 

Layer  composition  can  be  considered  an  asymmetric 
form  of  parallel  composition.  For  Q  ||  /2  the  logical 
precedence  between  conflicting  actions  of  Q  and  R  is 
nondeterministically  determined.  For  layer  composition 
of  Q  and  R,  which  is  denoted  by  Q»R,  actions  from 
Q  take  logical  precedence  over  actions  from  R  in  case 
of  conflicts.  In  the  case  of  independent  actions  no  or¬ 
der  is  enforced  however,  just  as  is  the  case  for  parallel 
composition.  We  also  use  iterated  layer  composition  S®, 
analogously  to  iterated  sequential  composition  S*. 

Layer  composition  should  be  compared  with  sequen¬ 
tial  composition  of  the  form  Q  ;  R.  This  is  somewhat 
like  layer  composition  except  that  we  also  enforce  tem¬ 
poral  ordering  between  Q  and  R  actions:  all  Q  actions 
temporally  precede  all  R  actions,  regardless  of  conflicts. 
So  whereas  conflict  composition  admits  parallel  execu¬ 
tion  of  certain  actions,  sequential  composition  does  not. 
A  sharp  difference  between  the  two  forms  of  composition 
shows  up  when  we  consider  Elrad  and  Francez’  “com¬ 
munication  closed  layers”  [EF].  The  essence  of  commu¬ 
nication  closed  layers  is  that  under  certaun  conditions 
a  parallel  system  5  ||  T  where  S  and  T  are  sequential 
programs  of  the  form  Sq  ;  Sy  and  To  ;  T\,  is  “equiva¬ 
lent”  to  a  sequential  composition  of  “layers”  So  ||  7b 
and  Si  II  Ti ,  thus; 

{So  ;  5i)  II  {To  ;  T’l)  =  (So  ||  To) ;  (Si  ||  TiK*) 

The  side  condition  is  that  there  is  no  communication, 
or  in  our  parlance  no  conflict,  between  actions  from  So 
and  7i,  nor  should  there  be  conflicts  between  action 
from  Si  and  Tq.  Generalized  forms  of  this  principle  ap¬ 
pear  also  in  [SR].  The  equivalence  used  in  (*)  is  some¬ 
times  called  lO-equivalence,  referring  to  the  fact  that 
although  the  histories  of  left  hand  and  right  hand  sides 
of  {*)  are  not  the  same,  the  relation  between  initial  and 
final  states  of  the  system  is  the  same  nevertheless.  A 
problem  with  this  equivalence  is  that  it  is  not  a  con¬ 
gruence,  so  we  cannot  simply  interchange  left  and  right 
hand  side  of  (♦)  within  contexts!  Within  our  framework 
we  can  replace  the  sequential  composition  in  (*)  by  con¬ 
flict  composition  however,  resulting  in  the  following  al¬ 
gebraic  law  given  for  the  case  of  two  layers  consisting  of 
two  parallel  components  (with  the  same  side  conditions 


as  for  (*)). 

(So. Si)  II  (To.Tx)  =  (So  II  To). (Si  ||  T,)  (CCL) 

Note  that  we  not  only  have  a  congruence,  but  even 
semantic  equality  here,  which  is  to  be  understood  as 
the  fact  that  both  sides  of  the  equation  admit  exactly 
the  same  partial  order  based  histories.  We  also  use  a 
number  of  derived  laws,  see  [Zwij.  A  special  case  is  the 
well-known  independence  law: 

If  P  and  Q  are  non-conflicting,  then 

p.g  =  p||g 

The  process  term  io(P)  denotes  execution  of  a  sin¬ 
gle  action  that  captures  the  net  effect  of  executing  P 
without  admitting  interference  by  other  events.  The 
io(  )  operation  is  also  called  the  contraction  operation, 
since  it  contracts  complete  P  runs  into  single  events. 
Intuitively  io(P)  represents  the  input-output  behavior 
of  a  process  P  if  we  execute  that  process  in  isolation, 
i.e.  without  interference  from  outside.  This  operation 
induces  an  interesting  process  equivalence,  called  10- 
equivalence,  and  an  associated  lO-refinement  relation. 
Such  equivalences  play  an  important  role  in  the  book 
by  Apt  and  Olderog  [AOj. 

P^=Qiffio(P)  =  io(Q) 

Specification  of  what  is  often  called  the  functional  be¬ 
havior  of  a  process  P  is  really  a  specification  of  io(P), 
i.e.  of  the  lO-equivalence  class  of  P.  The  io(-)  operation 
does  (obviously)  not  distribute  though  parallel  compo¬ 
sition.  For  the  case  of  layer  composition  we  have  the 
following  law; 

P.Q^=  io(P).io(g) 

The  intuition  here  is  that  adthough  execution  of 
“layer”  P  might  overlap  execution  of  “layer”  Q  tem¬ 
porally,  one  can  pretend  that  all  of  P,  here  represented 
as  an  atomic  action  io(P),  precedes  all  of  Q  as  far  as 
lO-behavior  is  concerned. 

lO-behavior  of  a  system  can  also  be  specified  by 
means  of  classical  pre-  and  postconditions.  We  inter¬ 
pret  a  Hoare  style  formula  of  the  form; 

{pre}  S  {post  }(**) 

where  pre  and  post  are  state  formulae  as  usual,  as  fol¬ 
lows.  For  each  history  h  in  io(5)  let  so{h)  and  s{h) 
denote  the  initial  and  final  state  of  the  (unique)  5  event 
in  h.  Then  (♦*)  requires  that  if  the  initial  state  so{h) 
satisfies  precondition  pre  the  corresponding  final  state 
8{h)  satisfies  the  postcondition  post.  Hoare  style  pro¬ 
gram  verification  for  concurrent  systems  is  more  com¬ 
plicated  than  verification  of  sequential  programs  due  to 
the  possibility  of  interference.  The  classical  proof  sys¬ 
tem  for  shared  variables  by  Owicki  and  Gries  [OG]  for 
instance  includes  extra  interference  freedom  checks  for 
assertions  used  in  proof  outlines.  It  has  been  shown  by 
Apt  and  Olderog  [AO]  that  for  restricted  cases  it  is  pos¬ 
sible  to  verify  parallel  programs  relying  on  techniques 
for  sequential  programs  however.  This  work  relies  on 
classical  Hoare  style  verification  in  combination  with 
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program  transformation  based  on  lO-equivalence.  We 
use  similar  techniques  in  the  derivation  of  the  algorithm, 
where  we  exploit  the  fact  that  conflict  composition,  al¬ 
though  it  does  admit  parallelism,  behaves  just  like  se¬ 
quential  composition  when  we  apply  the  io()  operation! 
This  follows  from  the  fact  that  the  contraction  of  some 
history  h  can  be  determined  without  taking  temporal 
ordering  into  account;  logical  precedence  as  such  is  suf¬ 
ficient  to  determine  the  cumulative  state  transformation 
associated  with  h. 

This  implies  that  to  verify  a  pre-post  specification 
for  a  program  of  the  form  5  •  T  it  suffices  to  verify  the 
associated  sequential  program  S  ;  T. 

In  the  derivation  we  use  the  combination  of  Hoare 
style  formulas  and  program  transformation  to  guarantee 
the  correctness  of  some  transformation  steps.  This  can 
be  seen  as  proof  outline  transformation,  in  the  style  of 
Reynolds  ([Rey].)  For  example  we  have  the  following 
rule  for  iterated  layer  composition. 

Define  layer  I  :  P{1)  until  B  ss  (P»  S)®  •  If 
P(l)  is  of  the  form 

P{1)  =  for  V  £V  dopar  P(v)(l)  rof  , 
and  B  is  of  the  form 
B  =  Vw  e 

and  if  furthermore  the  following  premis  are  satisfied. 
Fori 

P{v){l)  does  not  conflict  with  P{y'){l') 
and 

{B}  P  {B  V  (Vu  €  V(-iB(v))} 
then 
layer  I 

for  V  £.V  dopar  P(w)(l)  rof 
until  B 

=  {  CCL-  iteration  } 

for  V  e  V  dopar 
layer  I ;  P(v)(l)  until  B 
rof 

Informally  the  last  premisse  states  that  all  parallel  com¬ 
ponents  must  stop  at  the  same  number  of  iterations. 

We  conclude  this  section  with  a  somewhat  more  de¬ 
tailed  description  of  the  shared  memory  model  and  the 
communication  mechanism  used  in  the  description  of 
the  algorithm. 

Shared  memory  and  communication 

In  our  model  the  basic  actions  are  guarded  assignments 
of  the  form 

bi^Xi ,  X2,  •  •  ■  I  •“  1  *  '  ' ) 

Informally  such  an  assignment  is  postponed  until  the 
guard  b  holds,  where  after  the  values  of  the  expres¬ 
sions  expi  are  assigned  simultaneously  to  all  Xj.  So  our 
guarded  assignments  are  really  limited  forms  of  the  well 
known  await  statement.  If  the  guard  is  true  it  is 
omitted. 


We  assume  that  there  exists  a  given  conflict  relation 
between  actions,  for  example  two  action  conflict  if  one 
writes  into  a  variable  the  other  action  reads  or  writes. 
We  could  also  assume  read-read  conflict  too,  but  will 
not  do  so  in  this  paper.  At  later  stages  we  also  use 
communication  via  channels.  We  can  model  undirec- 
tional,  asynchronous  channels  by  shared  variables.  A 
channel  c  can  be  defined  as  a  pair  {c.flag,  c.val)  where 
c.flag  is  a  boolean  that  is  true  iff  a  value  is  available  on 
the  channel,  and  c.val  the  value  to  be  read.  Send  and 
receive  actions  can  now  be  modeled  as  guarded  assign¬ 
ments.  The  channel  name  c  of  send  and  receive  actions 
is  a  triple  given  by  the  node  the  emitting  the  message, 
the  node  receiving  it,  and  a  name.  Let  c  =  (u,  v,  Msg): 
send(u)(t>)(MsG(e))  **= 

-‘C.flagk.  c.flag,  c.val  :=  true,  e 
recetue(u)(tj)(MsG(i)) 

c.flagk  c.flag,x  :=  false,  c.val 
We  can  take  a  more  liberal  view,  where  we  have 
buffered  channels,  which  is  needed  in  the  final  stages 
of  the  derivation.  We  do  not  present  a  full  syntax  of  the 
language  used  in  the  derivation.  The  operators  used 
are  straightforward  abbreviations  of  expressions  using 
the  operators  given  above. 

3  Derivation  of  the  algorithm 

As  we  explained  in  the  introduction,  the  derivation  fol¬ 
lows  a  number  phases,  starting  of  with  a  simple  and 
easy  to  prove  sequential  program  and  finzdly  arriving  at 
a  distributed  and  partially  optimized  set  of  processes. 
In  this  section  we  give  an  outline  of  the  totaJ  derivation 
process  and  exemplify  a  number  of  crucial  steps  in  the 
development.  The  derivation  is  presented  as  a  top-down 
structured  process.  This  does  not  comply  with  the  true 
derivation  process  as  both  the  initial  design  and  the  fi¬ 
nal  implementation  were  known  on  beforehand.  The 
derivation  given  is  the  result'  of  closing  the  gap  from 
both  sides,  eventually  resulting  a  clear  derivation  show¬ 
ing  the  correctness  of  the  distributed  implementation. 
The  final  result  of  the  derivation  follows  the  GHS  pro¬ 
tocol  closely,  but  has  some  improvements  from  the  point 
of  view  of  top-down  design  of  programs.  Furthermore 
not  all  optimizations  are  introduced.  See  [JZa]  for  a 
derivation  of  the  whole  protocol. 

In  the  derivation  we  can  distinguish  a  number  of  dif¬ 
ferent  stages,  each  given  by  a  number  of  relatively  sim¬ 
ple  transformation  steps: 

1.  The  initial  (sequentially  structured)  design 

2.  Distributing  data 

3.  Recursively  computing  the  minimum  weight  outgo¬ 
ing  edge 
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4.  Synchronization  and  information  diffusion  by 
means  of  message  passing 

5.  Applying  the  Communication  Closed  Layers  law  to 
get  a  distributed  implementation 

6.  Multiplexing  channels 

7.  Optimizing  the  algorithm 

Of  all  these  steps,  the  application  of  CCL  is  a  purely 
algebraic  one,  although  it  requires  some  non-algebraic 
transformations  in  order  to  satisfy  the  premisses  of 
CCL.  Other  parts  of  the  derivation  are  proven  in  an 
axiomatic  way  or  as  a  combination  of  both  strategies. 
We  will  emphasize  the  first  few  steps  and  the  application 
of  the  CCL  laws. 

The  initial  design  closely  follows  the  algorithm  intro¬ 
duced  by  Boruvka  which  can  be  found  in  [Tar].  As  it 
is  a  sequentially  structured  algorithm  its  correctness  can 
be  shown  using  classical  Hoare  style  techniques  [Lam]. 
It  is  not  purely  sequential  however  as  it  is  formulated 
using  layer  composition  instead  of  sequential  compo¬ 
sition.  This  to  allow  the  program  to  be  transformed 
and  distributed  using  the  algebraic  framework  we  intro¬ 
duced.  However,  the  overall  behavior  of  this  system  can 
be  viewed  as  if  it  executes  sequentially.  According  to 
our  viewpoint  the  use  of  sequential  composition  should 
be  restricted  to  those  cases  when  one  really  means  to 
introduce  a  temporal  relation  instead  of  a  causal  rela¬ 
tionship.  In  an  initial  design  this  is  hardly  ever  the 
case,  as  no  architectural  decisions  have  been  taken  into 
account  yet. 

Before  describing  this  algorithm  we  introduce  some 
notation  and  theorems  on  minimum  weight  spanning 
trees. 

3.1  Spanning  trees  and  fragments 

Assume  we  have  a  given  connected  and  undirected  graph 
G  =  {V,  E).  We  assume  every  edge  i  has  a  distinct 
weight  w(i).  We  assume  all  nodes  have  distinct  names 
and  are  totally  ordered.  In  the  following  let  u,  v,  x  and 
y  denote  vertices,  and  let  i,j,  k,...  denote  edges.  Edges 
are  also  denoted  by  two-element  sets  {u,  u}.  For  a  graph 
G  the  following  theorem  holds 
Theorem  3.1 

If  G  is  a  connected  graph  where  every  edge  has  a  distinct 
weight,  the  minimum  weight  spanning  tree  MST{G)  is 
uniquely  determined.  □ 

The  proof  can  be  found  in  [GHS]. 

For  any  node  u  6  V  let  inc(t;)  denote  the  set  of  edges 
incident  to  v,  i.e. 

inc{v)  {  {u,  u}  6  E} 

For  an  edge  j  =  {«,  v}  let  the  destination  of  j  with 
respect  to  u,  dest{u){j)  be  v.  We  also  use  the  source  or 
destination  of  an  edge  w.r.t.  a  fragment  or  set  of  nodes. 


e.g.  for  fragment  /  and  edge  j  =  {u,  u}  such  that  v  / 
and  u  6  /  we  have  that  src{f){j)  =  u. 

A  fragment  is  a  connected  subgraph  of  MST{G).  For 
any  fragment  /  let  fi{f)  '  e  the  minimum  weight  outgo¬ 
ing  edge  of  /. 

The  basic  idea  of  the  algorithm  follows  from  the  fol¬ 
lowing  lemma,  which  is  proven  in  [GHS]. 

Lemma  3.2 

Let  G  =  (V,E)  be  a  connected  graph  where  every 
edge  has  a  distinct  weight,  and  let  /  be  a  fragment  of 
MST{G).  If  k  is  the  minimum  weight  outgoing  edge  of 
/  then  joining  k  and  its  adjacent  nonfragment  node  to 
/  yields  amother  fragment  of  MST{G).  □ 

In  the  same  way  we  can  combine  two  fragments  with 
a  connecting  minimum  weight  outgoing  edge  into  a  new 
fragment. 

The  adgorithm  we  introduce  in  the  next  section  is 
based  on  Boruvka’s  algorithm  [Bor,  Tar].  The  rough 
idea  is  as  follows.  We  compute  a  set  of  fragments  frag 
by  iteratively  combining  fragments  and  their  minimum 
weight  outgoing  edges.  Initially  frag  is  the  set  of  all 
fragments  that  consist  of  a  single  node  and  no  edges 
(which  is  a  fragment  by  definition).  Then  every  frag¬ 
ment  determines  it  minimum  weight  outgoing  edge  and 
combines  with  the  fragment  on  the  other  side  of  the 
edge.  If  two  fragments  share  the  same  minimum  weight 
outgoing  edge  j,  then  j  is  called  the  core  of  the  newly 
formed  fragment.  The  node  adjacent  to  the  core  with 
the  least  name  is  called  the  core  node.  ^  If  a  fragment 
is  not  combined  via  a  core  to  another  fragment  it  is  said 
to  be  absorbed. 

The  algorithm  terminates  when  we  only  have  a  single 
fragment  left,  MST{G). 

Every  fragment  in  this  algorithm  consists  of  a  core 
node  that  is  the  root  of  the  tree  consisting  of  all  other 
branches  and  other  nodes  in  the  fragment.  This  tree 
structure  is  used  to  gather  information  in  the  tree  or  to 
broadcast  information. 

In  the  derivation  we  also  need  the  following  lemma. 
One  of  the  chzuracteristic  features  of  the  GHS  protocol  - 
postponement  of  absorption  -  is  partially  based  on  this 
lemma. 

Lemma  3.3 

Let  /  be  a  fragment  and  let  j  be  an  edge  of  /.  Removing 
j  (but  not  its  endpoints)  from  /  disconnects  /  in  two 
disjoint  trees,  at  least  one  of  which  -  say  /i  -  does  not 
contain  the  core  of  /.  (if  j  is  the  fragment  core  any 
of  the  two  subtrees  can  be  taken.)  We  then  have  that 
m(/i)  =  j- 

Proof:  see  [JZa]  □ 

^This  is  different  from  [GHS]  where  both  nodes  adjacent  to  a 
core  play  equivalent  roles.  In  the  top-down  design  of  the  algorithm 
the  choice  for  a  single  core  node  is  more  straightforward  and  leads 
to  more  elegant  solutions,  without  essentially  changing  the  ideas 
of  GHS.  We  therefore  take  this  choice. 
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In  the  rest  of  this  paper  we  furthermore  use  the  fol¬ 
lowing  operations  on  graphs  and  trees.  Let  G  =  (V,  E) 
and  H  —  (V',  E')  be  graphs.  We  now  define  the  union 
of  G  and  H  as 

{y\JV\E\JE') 

For  node  v  and  edge  k  let 
u  €  G  =%  €  V  and  fc  e  G  ='  Jfe  G 

2Uld 

GUM'^'CVUM.E), 

GLi{k}  =  iV,Eu{k}). 

Furthermore,  for  any  fragment  /  let  tnc(/)  be  the  set 
of  edges  leaving  /  (i.e.  the  set  of  all  {u,  v}  such  that 
u  G  /  and  V  ^  /).  For  a  node  u  we  define  ou<(/)(u)  as 
the  set  of  edges  incident  to  u  leaving  /,  i.e. 

cnd{f){y)  =*  mc(v)  n  mc(/) 

If  /  is  clear  from  the  context  it  is  omitted. 

We  define  the  minimum  weight  edge  of  a  non-empty 
set  E,  min-edge{E)  as 

min-edge{E)  e  £  E  such  that 

u;(e)  =  MIN{wie')  \e'  eE} 

and  define  min^edge(<l))  nil.  We  take  w{nil)  oo. 
Finally,  for  a  graph  G  =  (V^,  E),  let  concomp{G)  be  the 
set  of  connected  componenents  in  G,  i.e.  the  set  of 
maximal  and  connected  graphs  in  G. 

3.2  The  initial  design 

The  first  implementation  is  based  on  the  construction 
of  a  set  of  fragments  frag  which  determine  their  min¬ 
imum  weight  outgoing  edges  and  combine  connected 
fragments.  Initially  the  set  frag  consists  of  all  trees  con¬ 
sisting  of  a  single  node  and  no  edges,  which  are  by  def¬ 
inition  fragments.  Furthermore  we  compute  the  set  of 
edges  B  that  are  branches,  i.e.  that  are  part  of  the  span¬ 
ning  tree.  This  is  the  basic  idea  of  Boruvka’s  algorithm. 
It  can  be  described  as: 

MSTo  = 

frag  :=  concomp{{V,  B))  • 

layer 

M  :=  {min-edge{inc{f))  \  f  G  frag,inc{f)  /  0}  • 
B  :=  BU  M  • 
frag  :=  concoTnp{{V,  B)) 
until  Af  =  0 

The  total  correctness  of  this  algorithm  follows  from  loop 
invariant  /q,  the  definition  of  m(/),  and  bound  function 
r,  that  are  defined  as: 

/o  :  \frag\  >  1  A 

V/  G  frag{f  is  a  subtree  of  MST\G))  A 
{V'  I  {V,  E')  G  frag}  is  a  partitioning  of  V 
and 

7=*  log(|/rop|) 

The  invariance  follows  from  the  initialization  and 
lemma  3.2.  The  number  of  fragments  \fra^  is  at  least 


divided  by  two  in  each  iteration  and  therefore  logd/ro^l) 
decreases.  FVom  the  invariant  and  the  termination  con¬ 
dition  the  postcondition 

Po:  frag={MST{G)} 
easily  follows. 

Although  we  take  MSTq  as  the  initial  design  in  our 
trajectory,  it  is  also  possible  to  give  an  even  more  gen¬ 
eral  algorithm  that  comprises  the  Prim-Dijkstra  and 
Kruskal  algorithms,  by  not  adding  M  to  B,  but  only 
adding  a  subset  of  M  to  B.  In  that  case  however  we 
loose  the  logarithmic  complexity  of  the  algorithm,  as 
the  number  of  fragments  decreases,  but  is  not  necessar¬ 
ily  divided  by  two. 

As  a  second  step  we  split  the  computation  of  M  into 
the  layered  computation  of  the  minimum  weight  outgo¬ 
ing  edge  for  every  fragment.  The  reason  for  doing  so  is 
that  we  want  to  distribute  data.  Firstly  per  fragment, 
eventually  per  node.  We  do  so  by  introducing  a  vari¬ 
able  mo(/)  for  every  fragment  /.  This  is  an  instance  of 
straightforward  top-down  design  and  Hoare  style  verifi¬ 
cation  for  sequential  programs.  The  following  represen¬ 
tation  function  for  M  will  hold: 

Af(mo)  '*=  {mo(/)  |  /  G  frag,  inc(/)  0} 

This  leads  to  the  following  refined  program; 

MSTi  = 

B:=0. 

frag  :=  concomp{{V,  B))  • 

layer 

for  /  G  frag  layer 
if  inc{f)  ^  0  then 
mo(/)  :=  min-edge{inc{f))  • 

B  :=  Bo  mo{f) 
else  mo{f)  ;=  nil 
fl 

rof  • 

frag  :=  concomp{{V,  B)) 
until  A{"'o(/)  =  nil  \  f  £  frag} 

The  correctness  of  this  transformation  step  can  be 
proven  by  means  of  the  representation  function  M  and 
the  structure  of  the  conditional  statement  that  implies 
that 

mo{f)  =  nil  iff  inc{f)  =  0 
as  inc{f)  ^  0  implies  min-edge{inc{f))  ^  nil. 

3.3  Distributing  data 

The  transformation  of  MSTi  to  a  program  where  all 
data  are  distributed  takes  a  number  of  steps.  First  we 
will  distribute  B  by  introducing  variables  SE  for  ev¬ 
ery  edge.  This  allows  for  the  introduction  of  parallelism 
between  the  different  fragments  as  conflicts  due  to  the 
acces  to  B  are  resolved. 

Thereafter  we  introduce  variables  /mo(u)  giving  the  lo¬ 
cal  minimum  weight  outgoing  edges  of  every  node  of  ev- 
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ery  fragment.  Finally  we  introduce  a  core  node  for  every 
fragment,  by  giving  defining  boolean  variables  core(v) 
for  every  node  v,  such  that  for  every  fragment  there  is 
a  single  node  in  the  fragment  with  core(v). 

We  first  introduce  variables 
SE(u)(v)  e  {branch,  baaic) 

for  every  {u,  w}  €  E.  The  following  representation  func¬ 
tion  B{SE)  will  hold: 

B{SE)  {  {u,  v}  G  £  1  SE{u){v)  =  branch} 

The  transformation  consists  of  adding  initialization  of 
every  SE{u)(v)  and  of  replacing 
B:=Bli{mo(f)} 
by 

SE{src{f){mo{f)))  ideBt{f){mo{f)))  :=  branch 
The  guard  inc{f)  0  can  sdso  be  replaced  by  mo{f)  ^ 
nil  after  computing  mo(/)  as  min-edge(ii)  =  nil. 

The  correctness  of  this  step  can  be  proven  by  VDM 
style  data  refinement  with  atomicity  constraints.  These 
constraints  are  fulfilled  as  every  layer  is  interference  free, 
because  all  actions  are  placed  in  the  same  layer  and  no 
parallel  interfering  processes  exist. 

After  this  transformation  we  can  replace  the  lay¬ 
ered  construct  for  /  G  frag  layer  by  for  /  G 
frag  dopar  ,  as  all  conflict  are  resolved.  The  correct¬ 
ness  of  this  transformation  is  guaranteed  by  the  inde¬ 
pendence  law. 

The  code  of  the  resulting  program  is  omitted. 

Now  we  introduce  local  minimum  weight  outgoing 
edges  for  every  node  in  every  fragment,  lmo{v).  For 
these  variables  the  following  invariant  must  hold. 

Ii  :  mo(/)  =  Tnin,edge({lmo{f)  |  /  G  frag}) 

This,  plus  the  previous  changes,  leads  to  the  following 
algorithm; 

AfSTa  = 

Init* 

layer 

for  /  G  frag  dopar 
for  w  G  /  dopar 
lTno{v)  min-edge{<mt(f)(v)) 
rof  • 

mo{f)  :=  min-edge{{lmo{v)  |  u  G  /})  • 
if  mo{f)  ^  nil  then 

5£'(«rc(/)(mo(/)))  ide3t{f){mo{f)))  :=  branch 
fl 

rof  • 

frag  :=  concomp{{V,  B{SE))) 
until  A{^(/)  =  nil  \  f  €  frag} 


where 
Init  = 

for  V  G  V  dopar 

for  {u,  v}  G  tnc(v)  dopar 
SE{u){v)  :=  basic 
rof 
rof  • 

/:=  concomp{{y,B{SE))) 

The  correctness  of  these  transformation  steps  can 
again  be  easily  verified  (see  [JZa]}. 

In  MST2  we  still  have  the  set  of  fragments  /rap  as  a 
variable.  This  information  must  be  localized  too.  We 
do  so  by  introducing  boolean  variables  core(v)  for  ev¬ 
ery  node  V,  and  defining  a  function  ^rag{u)  giving  the 
fragment  u  belongs  to,  i.e.  the  set  of  nodes  and  edges 
connected  to  u  by  branches. 

^i(v)  {«  G  V  I  connected(v, «)} 

Trag{v) 

mv),  {  {«,«'}  G  I  SE{u){u')  =  branch}) 

where  connected  means  that  v  zmd  u  are  connected  via 
a  path  of  branches.  We  leave  its  definition  implicit. 

We  define  the  following  representation  function  and 
(data)  invariants: 

F{core)  {.^rop(u)  1  u  G  V,  core(u)} 

MO{F'rag{u))  =*  mo(r)  such  that  v  G  Frag{u)  A  core(i 
h:  Vw  G  V  (3!u  G  .^rop(t»)  (core(«))) 
h  :  V{«,r}  G  E  {SEiu){v)  =  SEiv){u)) 

Data  invariant  I3  is  now  needed  to  guarantee  the  cor¬ 
rectness  of  the  definition  of  connected.  Furthermore  we 
will  make  use  of  this  in  later  stages. 

We  define  the  set  Core  as 
Core  {w  G  V  I  core(w)} 

How  can  we  now  achieve  I2,  i.e.  how  do  we  choose 
a  unique  core  node  for  every  fragment?  Initially  this  is 
no  problem;  every  fragment  consists  of  a  single  node. 
After  that  we  know  that  for  every  new  fragment  there 
were  two  exactly  subfragments  that  had  the  same  mini¬ 
mum  weight  outgoing  edge,  the  core  edge.  As  dl  nodes 
eure  ordered  we  can  take  the  first  of  the  nodes  adjacent 
to  the  core.  In  order  to  determine  which  nodes  are  ad¬ 
jacent  to  minimum  weight  outgoing  edges  we  introduce 
veuiables  con.re^u){v)  stating  that  {u,  v}  was  the  min¬ 
imum  weight  outgoing  edge  of  Frag{u).  The  following 
invariant  will  therefore  hold: 

I4  :  Vu  G  Core  {con.reg{v){v')  iff 

{«,  w'}  =  mo(u)  A  w  G  .Frop(u)) 

Apart  ^m  some  minor  changes  in  MSTi  we  have 
to  establish  It  at  the  end  of  the  loop,  i.e.  the  final 
statement  in  the  layer  . . .  until  will  be  ComputeCore, 
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defined  as 
ComputeCore  = 

for  u  €  Core  dopar  core(u)  :=  false  rof  • 
for  V  eV  dopar 
for  {w,  x}  €  mc(v)  dopar 
if  con-req{v){x)  then 
con-re^v){x)  ;=  false  • 
if  SE{v){x)  =  branch  then 
if  «  <  I  then 
core(u)  :=  true 
fl 

else  SE{v){x)  :=  branch 
fi 
fi 
rof 
rof 

The  correctness  of  this  transformed  MST2,  which  we 
will  call  MSTs,  can  again  be  proven  by  proof  outline 
transformation  and  Owicki-Gries  style  verification  with 
trivialized  interference  freedom  tests  because  of  local 
variables.  In  this  proof  lemma  3.2  and  theorem  3.1  are 
needed.  The  proof  itself  is  omitted. 

3.4  Recursively  computing  the  mini¬ 
mum  weight  outgoing  edge 

Now  we  have  introduce  core  nodes  we  can  make  use  of 
the  fact  that  we  can  view  a  fragment  as  a  rooted  tree 
to  compute  mo(«)  for  every  core  node  u.  To  do  so  we 
introduce  variables  up{v)  denoting  the  edge  toward  the 
root  of  node  v  in  the  fragment  (for  -'core(v)).  Further¬ 
more  we  introduce  a  variable  mo{v)  for  every  node  v, 
not  only  core  nodes,  denoting  the  minimum  weight  out¬ 
going  edge  of  the  tree  in  the  fragment  of  which  v  is  the 
root. 

We  also  need  to  synchronize  the  nodes  in  a  fragment: 
a  node  v  needs  the  values  of  all  its  successors  in  the  tree 
to  compute  mo{v).  For  this  purpose  we  define  synchro¬ 
nization  flags  mo-comp(u)  that  are  set  to  true  when  the 
value  of  mo(u)  has  been  computed.  Synchronization  is 
also  needed  to  inform  edges  that  have  to  change  their 
root  and  upward  edge,  i.e.  nodes  that  are  on  the  path 
from  the  core  to  the  minimum  weight  outgoing  edge, 
as  we  do  not  want  to  implement  this  by  core  actions 
only.  For  this  purpose  we  introduce  three-valued  flags 
change{u)  6  {true,  false,  ±}.  In  the  algorithm  below 
these  synchronization  variables  are  indexed  by  the  num¬ 
ber  of  the  layer.  This  in  order  to  guarantee  synchroniza¬ 
tion  w.r.t.  to  that  layer,  or,  viewed  differently,  to  make 
the  layers  communication  closed.  We  come  back  to  that 
later. 


The  following  notation  is  used: 
down(v)  =*  (j  =  {v,  u}  e  inc(v)  | 

SE(v)(u)  =  branch  A  j  /  up(v)}, 

tree(v)  =*  {({v},  dottm(t;))}U 

U{free(u)  |  {n,u}  G  down(v)}, 
and 


xdjf  f  V,  iff  core(v) 

roo  (V)  —  I  root(dest(v)(up(v))),  iff  -icore(v) 

We  furthermore  define  the  path  between  two  (con¬ 
nected)  nodes  u  and  v  as  the  sequence  of  nodes  on 
the  path.  The  full  definition  is  omitted.  For  a  path 
p  =  [uvw ...]  we  define  the  first  edge  of  p,  first(p),  as 
the  pair  {u,  v}. 

For  the  next  algorithm  MSTi  the  following  invariants 
will  hold: 

/s  :  mo.comp{v)  => 

mo{v)  =  min-edge{mo{u)  \  u  G  free(v)} 

If,  :  (mo_comp(tj)  A  doion(v)  =  0)=»’mo(t))  =  lmo(v) 
h  :  Vu,v  G  V((core(v)  A  connected{u,v))=> 
up{u)  =  first{path{u,  v)) ) 


For  sake  of  brevity  we  immediately  introduce  a  second 
addition  in  this  algorithm.  Let  be{v)  be  the  edge  leading 
to  the  minimum  weight  outgoing  edge,  i.e. 

/g  :  Vu,  V  G  V((u  ^  vA 

fmo(u)  =  mo(t;)  A  connected{v,  u))^ 

6e(t;)  =  first(jpath{v,  u)) ) 

/g  :  lmo{u)  =  mo{u)^be{u)  =  lmo{u) 

All  this  leads  to  the  following  algorithm: 

MSTt  = 

Inii» 
layer  I 

ComputeLocal{l)  • 

ComputeGlobal{l)  • 

ChangeRootPa^{l)  • 

ComputeCore{l) 
until  A{’tm>(*')  =  I 
where 
Init  = 

for  u  G  V  dopar 

core{u)  :=  true\\up{v)  :=  m7||6e(w)  :=  ml|( 
for  {u,  r}  G  inc{u)  dopar 
SE{u){v)  :=  6osic||  con-re9(u)(t;)  :=  f(Use 

rof 
rof  , 

ComputeLocal{l)  = 
for  u  G  Cor:  dopar 
for  V  G  J^rag{u)  dopar 
lmo{v)  :=  min.edpe(o«t(.Frop(u))(v))|| 
Tno-comp{v){l)  :=  false  rof 
rof  , 
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CoTnputeGlobal{l)  = 
for  u  €  Core  dopar 
for  i;  €  Tragiu)  dopar 
mo(u),6e(u)  :=  lmo{v),lTno{v) 
for  {r,  x}  e  down{v)  dopar 
await  mo-comp{x){l)  do 

if  w{mo{x))  <  w{mo{v))  then 
mo(v),  be(v)  :=  mo(x),  {«,  x} 
fl 
od 
rof  ; 

mo-comp(v){l)  :=  true 

rof 

rof 

Note  that  we  need  the  sequential  composition  at  this 
stage  to  enforce  the  right  moment  of  synchronization. 
Furthermore  let 
ChangeRootPath{l)  = 
for  u  e  Core  dopar 
change{u){l)  :=  (mo(u)  ^  nil)  • 
for  u  €  !Frag{u)  dopar 
await  change{v)  ^  ±  • 
if  change{v){l)  then 
up{v)  :=  be(v)  • 

if  S E{v){deat{v){be{v))  =  branch  then 
change{deat{v){be{v))){l)  true 
else 

SE{v){deat{v){be{v))  :=  branch* 
con-reqiv){{deat{v){be{v))  :=  true 
fl 
fl  • 

for  i  e  dovjniy)  -  {6e(v)}  dopar 
change(deat(v)(i))(l)  :=  falae 

rof 

rof 

rof 

and  Compute Core(l)  analogously  to  MST3. 

The  correctness  of  the  above  solution  is  quite  involved 
as  it  includes  proving  deadlock  freedom  and  correctness 
of  the  recursive  definition.  ^  The  proof  however  can  be 
restricted  to  a  single  layer  within  a  single  execution  of 
the  loop  which  simplifies  matters  to  a  large  extend.  It 
can  be  proven  correct  using  Hoare  logic  [Lam]  or  tem¬ 
poral  logic  [MPj. 

3.5  Introducing  message  passing 

In  MST4  we  had  to  introduce  variables  to  synchronize 
actions  and  we  had  to  copy  values  computed.  As  we  are 
thriving  for  a  distributed  solution  it  is  very  well  possi¬ 
ble  to  introduce  communication  over  channels  to  enforce 
synchronization  and  to  pass  values.  As  send  and  receive 

^The  solution  given  above  deviates  from  [GHSJ  in  the  fact  that 
we  only  change  up(v)  on  the  path  from  the  core  to  the  new  mini¬ 
mum  weight  outgoing  edge.  In  [GHS]  every  up(v)  variable  is  reset 
in  every  iteration  when  an  INITIATE  is  received. 


actions  are  defined  as  guarded  assignments  this  trans¬ 
formation  is  straightforward,  and  simplifies  matters  to 
a  large  extend. 

Furthermore  we  want  to  remove  all  shared  accesses 
from  the  algorithm,  as  these  are  not  possible  in  dis¬ 
tributed  implementations.  We  therefore  have  to  adapt 
the  computation  of  Imo,  remove  the  shared  accesses  to 
con.req,  mo(x),  and  references  to  IFrag{u).  This  is  done 
by  introducing  message  passing  and  variables  fn{v)  de¬ 
noting  the  fragment  name  of  v. 

Some  further  simplifications  are  possible:  we  hardly 
ever  refer  to  the  edge  mo(v),  but  often  to  its  weight.  We 
therefore  use  variables  bw  instead  of  mo.  Also  the  vari¬ 
ables  Imo  are  oblivious  as  their  function  can  be  taken  by 
bva.  Finally  we  can  join  the  parallel  executions  over  all 
core  nodes  and  all  nodes  in  the  corresponding  fragment 
to  the  parallel  execution  over  every  node  in  V. 

The  result  of  these  transformations,  MST3  has  the 
following  structure.  The  full  code  is  omitted  because  of 
space  limitations. 

MSTs  = 

for  u  6  V  dopar  Inii^v)'  rof  • 
layer  I 

for  V  G  V  dopar  ComputeLocal{v){l)'  rof  • 
for  V  gV  dopar  ComputeGlobal(v)(iy  rof  • 
for  V  gV  dopar  ChangeRoctPatb{v){l)'  rof  • 
for  V  gV  dopar  ComputeCore{v){l)'  rof  • 
for  V  GV  dopar  ChangeName{v){l)'  rof 
until  ^{6u;(v)  =  00  ]  v  6  V} 

The  processes  ComputeLocat  and  ComputeCore'  can 
both  be  split  into  two  processes  by  means  of  algebraic 
transformations  and  proof  outline  tremsformations.  The 
former  can  be  split  into  a  kernel  process  concerned  with 
computing  be{v)  and  a  test  handler  TH{y)  reacting 
upon  Test  messages  sent  by  other  processes,  the  lat¬ 
ter  into  a  process  possibly  trying  to  connect  and  a  con¬ 
nect  handler  CH{y)  responding  to  Connect  messages 
of  other  nodes. 

In  this  process  we  use  the  rule  that  if  there  are  only 
a  single  send  and  a  single  receive  action  on  a  channel, 
executing  them  in  parallel  is  the  same  as  first  sending 
and  then  receiving  (see  [JZ]). 

We  define  6a8tc(v)  "=  {{u,  1}  €  inc{v)  j  SE{v){x)  = 
6a5ic}.  The  processes  TH{v)  and  CH{v)  have  the  fol¬ 
lowing  form: 

TH{v)  = 

for  {u,  x}  G  baaic(v)  dopar 
receire(u)(x)(TEST(/n(x)(v)))  • 
if  fn(x)(v)  =  fn(v)  then  send( t;)(x)(REJECT) 
else  3end(v)(x)(AccEPT) 

fl 

rof 
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CH{v)  = 

for  {v,  x}  e  ba««c(t;)  dopar 
send(w)(x)(NoCoNNECT)l| 
(receit;e(i;)(i)(NoCoNNECT)  or 
(rccci«e(w)(x)(C0NNECT)  •  SE{v){x)  :=  branch)) 

rot 

3.6  Applying  Communication  Closed 
Layers 

What  we  eventually  want  to  arrive  at  is  a  distributed 
implementation,  i.e.  an  implementation  of  the  form 
MSTf  = 
for  V  eV  dopar 
Init{v)  • 

layer  P{v)  until  B{v) 
rof 

The  algorithm  MSTs  however  still  is  of  a  sequential 
nature.  We  want  to  apply  the  Communication  Closed 
Layers  Law  to  transform  MST^  to  a  distributed  struc¬ 
ture.  To  be  able  to  apply  the  CCL  law  or  its  iterated 
version  MSTs  must  be  of  the  correct  structure  and  fulfill 
the  premisses. 

In  order  to  arrive  at  the  structure  desired  we  first 
have  to  transform  the  loop  body.  This  consists  of  a 
given  number  of  layers  that  are  all  of  the  form 
Li  =  for  V  eV  dopar  Li{v)  rof 
This  is  the  correct  structure  to  apply  CCL.  Furthermore 
all  layers  are  communication  closed  as  communication 
takes  place  within  a  layer,  and  other  conflicts  only  exists 
within  the  process  of  a  single  node.  We  can  transform 
the  loop  body  L  now  as  follows: 

L 

=  {by  definition  } 

( ComputeLocaf  \\TH)  • 

ComputeGlobat  • 

ChangeRootPath'  • 

{ComputeCorc'WCH)  • 

ChangeName' 

=  {  II  is  commutative  and  associative  } 

for  V  eV  dopar  CoTnputeLocal{v)'\\TH{v)  rof  • 
for  V  EV  dopar  ComputeGlobaliv)'  rof  • 
for  V  eV  dopar  ChangeRootPaiMy)'  rof  • 
for  V  eV  dopar  CompuieCorc{yy\[CH{y)  rof  • 
for  t;  6  V  dopar  ChangeName^v)'  rof 
=  {  CCL  } 

for  V  eV  dopar 
{ComputeLocal(v)'\\TH{v))  • 

ComputeGlobal{v)'  •  ChavgeRootPaiMy)'  • 
{Comp‘uteCore{v)'\\CH{v))  • 

ChangeName{v)' 


rof 

=  {by  definition  P{v)  } 
for  V  eV  dopar  P{v)  rot 
=  L' 

We  have  now  transformed  Z.  to  a  form  suitable  for  the 
application  of  CCL.  The  guard  of  the  loop  however  do 
not  satisfy  the  premia  of  the  iterated  CCL  rule: 

{B}  P  {B  V  (Vw  E  V{^B{v))} 

The  last  layer  of  the  loop  however  consists  of  a  broadcast 
of  the  name  of  the  fragment.  We  change  ChangeName 
in  such  a  way  that  the  new  firagment  name  is  term.  This 
allows  us  to  restate  the  termination  condition  as: 

B'  =  /\{fn{v)  =  term  |  u  €  V} 

This  condition  does  satisfy  the  premia,  as 
3v  €  V{fn{v)  =  term)^'iv  E  V{bw{v)  =  oo) 

We  can  now  transform  MSTs  as  follows: 

MSTs 

=  {by  definition  } 

Inii!  •  layer  I :  L  until  B' 

=  {  L  =  L'} 

Inii  •  layer  I :  L'  until  B' 

=  {  definition  L'  } 

Inii  • 
layer  I 

for  w  e  V  dopar  £,(«)'  rof 
until  B' 

=  {  iterated  CCL  } 

Inii  • 

for  V  eV  dopar 

layer  1 :  P{v)  until  /n(v)  =  term 
rof 

=  {  CCL  } 

for  V  eV  dopar 

Inii{v)'  •  layer  I :  P(v)  until  fn{v)  =  term 
rot 

=  {by  definition  } 

for  V  E  V  dopar  N(v)  rof 
=  MSTs 

The  result  of  these  transformations  MSTg  has  the  de¬ 
sired  distributed  structure. 

3.7  Multiplexing  channels  and  optimiz¬ 
ing  messages 

In  MSTs  we  used  different  channels  for  different  layers. 
This  asssumption  is  not  realistic  but  it  is  possible  to 
multiplex  a  number  of  channels  on  a  single  (buffered) 
channel.  The  buffer  length  must  be  at  least  the  num¬ 
ber  of  layers  that  are  executed,  which  is  limited  by 
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Iog(|V|).  A  channel  c  is  in  this  case  not  given  by  a 
pair  {flag,val),  but  by  a  function  c  :  [0..l]-*{flag,v(U). 
We  can  now  assume  that  the  level  number  is  tagged 
to  every  message  and  we  can  implement  a  send  action 
send(v)(u)(MsG(e,  In))  as 
send{v){u){MsG(e,  In)  = 

{-<c{ln).flag)kc{ln).flag,  c{ln).val  :=  true,  e 

By  now  we  have  transformed  the  algorithm  in  MSTs 
where  N{v)  has  a  structure  as  in  fig.  1,  where  some 
subroutines  have  been  left  implicit.  In  this  algorithm  a 
number  of  optimizations  are  possible.  First  of  all  we  can 
group  statements  for  core  and  non-core  nodes  by  using 
proof  outlines  and  transforming  them.  By  introducing 
the  right  invariants  we  can  furthermore  show  that  some 
messages  (e.g.  NoConnect  and  NoChange)  are  oblivi¬ 
ous  and  can  be  removed. 

There  are  also  some  other  optimizations  possible 
w.r.t.  to  the  number  of  messages:  if  we  received  a 
Reject  message  on  some  basic  edge,  we  will  alwa}rs  re¬ 
ceive  a  Reject  message.  Introducing  a  reject  status 
for  edges  will  save  double  status  requests.  We  can  also 
check  basic  edges  one  by  one,  instead  of  all  in  parallel. 
In  that  case  we  have  to  check  them  in  order  of  weight. 
This  optimization  allows  us  to  postpone  the  absorption 
of  fragments:  if  the  edges  allow  which  the  fragment  sent 
a  Connect  has  to  large  a  weight  lemma  3.3  guarantees 
that  we  can  absorb  it  at  a  later  stage. 

The  details  of  these  optimizations  can  be  found  in  the 
full  paper  ([JZa]). 

The  final  implementation  step  is  now  to  replace  lager 
composition  by  sequential  composition,  and  to  replace 
parallel  composition  by  sequential  iteration  within  a 
component.  This  does  not  invalidate  the  correctness 
(for  non-interfering  parallelism)  and  results  in  an  im- 
plementable,  distributed  algorithm. 

4  Conclusion 

The  layering  techniques  used  to  derive  the  implemen¬ 
tation  of  a  distributed  minimum  weight  spanning  tree 
algorithm  have  proven  to  be  a  powerful  means  in  the 
development  of  parallel  systems.  This  also  holds  for  a 
posteriori  verification  where  it  can  give  insight  in  the 
structure  of  the  implementation  and  the  intuitive  ideas 
of  the  designers. 

These  techniques  are  applicable  to  a  large  number 
of  problems,  not  only  to  this  type  of  algorithms.  Other 
examples  -  varying  from  parsing  algorithms  to  database 
protocols  -  can  be  found  in  [JZ],  [JPZ],  and  [PZ]. 

At  this  moment  we  are  investigating  the  relation  be¬ 
tween  the  process  based  approach  as  used  in  this  paper 
and  logic  based  approaches  to  layering  like  [KP].  The 
use  of  non-static  dependency  relations  might  be  useful 
in  our  context  too. 

Also  algorithms  relying  on  real-time  synchronization 


Niv)  ^ 

up{v)  :=  mi|lcore(v)  :=  true||/n(i;)  :=  w|| 
for  {v,  x}  €  tnc(v)  dopar 
se{v){x)  ;=  basic*  con.req{v)[x)  :=  false 
rof  • 
layer  I 

be(v),  bw(v)  :=  nil,  oo 
(  for  {v,x}  €  basic{v)  dopar 
aend(t;)(x)(TEST(/n(i;),  /))  • 
(receive(v)(x)(RBJBCT(l))  or 
(receive(v)(x)(AccEPT(l))  • 

( if  w({v,  x})  <  be(v)  then 
bw{v),be{v)  :=  w{{v,  i}),  {u,  x} 

fi))) 

rof  \\TH{v))* 
for  {v,  x}  e  down{v)  dopar 
{receive{v){x)(REPO«T{b{v),  1)  • 
if  fc(v)  <  l>tii;(t;)  then 
be{v),  bu){v)  :=  {«,  x},  6(t;) 

fi) 

rof  • 

if  ->core{v)  then 

send(v)(dest{v){up(v)){REPO«T(bw(v),l)) 
fi  • 

if  core(v)  then  ChangeRootPath 
else 

(receivc(v)((iest(v)(up(v)))(CHANGERooT(I))* 
up{v)  :=  be{v)  •  ChangeRootPath)  or 
(receit;e(v)((iest(t;)(«p(t;)))(NoCHANGE(I))« 
NoChangeRoot) 

fl  • 

core(y)  :=  false  • 

( if  con.regfy){dest{y){be{v))  then 
send(v)(de8t(v)(te(v))(CoNNECT(l))* 
con.req  :=  false* 

(receive(v)(de5t(t;)(6e(v))(NoCoNNECT(I))  or 
(rece*ve(i;)(de8t(v)(6e(T;))(CoNNECT(0)  • 
core(t;)  :=v  <  dest(v)(be{v)) )) 
fl  ||C^(v)). 
if  core(v)  then 

if  bii>(w)  =  c»  then  fn(v)  :=  term 
ebe  fn(v)  :=  r  fl 
else 

receive(v)(dest(v)(up(v))(lNniATB(fn(v),l)) 
fl  • 

BroadCastName(v) 
until  fn(v)  =  term 

Figure  1: 
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like  atomic  broadcast  protocols  [CASD]  are  studied  in 

our  framework. 
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Abstract 

Floyd’s  method  based  on  xuell-orderings  is  the  standard 
approach  to  proving  termination  of  programs.  Much 
attention  has  been  devoted  to  generalizing  this  method 
to  termination  of  programs  that  are  subjected  to  fair~ 
ness  constraints.  Earlier  methods  for  fair  termina¬ 
tion  tend  to  be  somewhat  indirect,  relying  on  program 
transformations,  which  reduce  the  original  problem  to 
several  termination  problems. 

In  this  paper  we  introduce  the  new  concept  of  stack 
assertions,  which  directly — without  transformations — 
quantify  progress  towards  fair  termination.  Moreover, 
we  show  that  by  one  simple  program  transformation 
of  adding  a  history  variable,  usual  assertional  logic, 
without  fixed-point  operators,  is  sufficiently  expressive 
to  form  a  sound  and  relatively  complete  method  when 
used  with  stack  assertions.  This  result  is  obtained  as 
pari  of  a  substantial  simplification  of  earlier  complete¬ 
ness  proofs. 

1  Introduction 

Fairness  is  the  assumption  that  an  action  that  is  en¬ 
abled  over  and  over  will  eventually  be  taken.  Such 
assumptions  are  central  to  many  distributed  or  con¬ 
current  systems.  The  fair  termination  problem — how 
to  prove  that  a  program  terminates  under  assump¬ 
tion  of  fairness — is  typical  to  much  reasoning  with 
fairness,  and  many  methods  for  this  problem  have 
been  suggested;  see  [A083,  AFK88,  DH86,  FK84, 
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Fra86,  GFMdRv85,  LPS81,  MP91,  SdRG89].  Most  of 
the  methods  build  on  Floyd’s  approach  of  using  well- 
ordered  sets  as  a  measure  of  how  close  the  program 
is  to  termination.  Floyd’s  ideas  allow  one  to  anno¬ 
tate  the  unaltered  program  with  assertions  expressing 
closeness  to  termination,  whereas  many  of  the  earlier 
methods  for  fair  termination  depend  on  changing  the 
program.  The  modifications  either  consist  of  adding 
new  program  variables  and  unbounded  nondetermin¬ 
ism,  or  involve  recursively  applied  proof  rules  that 
transform  the  program.  Unfortunately,  these  modi¬ 
fications  tend  to  obscure  how  each  step  of  the  original 
program  contributes  to  fair  termination. 

The  goal  of  this  paper  is  a  lucid  and  practical  ap¬ 
proach  for  showing  fair  termination  without  repeated 
or  drastic  transformations.  Our  approach  is  based 
on  the  novel  concept  of  progress  measure  introduced 
in  [Kla90];  see  also  [Kla,  KK91,  Kla91,  KS93].  A 
progress  measure  is  a  function  on  the  states  (or  histo¬ 
ries)  of  the  unaltered  program.  The  value  of  the  func¬ 
tion  for  a  given  state  quantifies — in  a  certain  mathe¬ 
matical  sense — how  close  that  state  is  to  satisfying  a 
property  about  infinite  computations.  The  property  is 
defined  by  a  specification,  which  characterizes  states 
(or  histories),  and  by  a  limit  condition,  which  when 
applied  to  the  specification  defines  the  allowed  infinite 
computations. 

The  property,  for  example,  could  be  that  every  infi¬ 
nite  computation  is  unfair — this  means  that  the  pro¬ 
gram  fairly  terminates.  In  this  case,  the  specification 
characterizes  for  each  state  which  actions  are  enabled 
and  which  are  tadcen;  the  limit  condition  expresses  the 
fixed  temporal  meaning  of  unfairness;  some  action  is 
enabled  infinitely  often  while  being  taken  only  finitely 
often. 

An  essential  property  of  a  progress  measure  is  that  on 
every  program  transition,  its  value  changes  in  a  way 
ensuring  that  the  computation  converges  according  to 
the  limit  condition.  This  requirement  can  be  formu¬ 
lated  as  verification  conditions,  which  allow  verificar 
tion  of  global  properties  (on  infinite  computations)  in 
terms  of  local  reasoning  (about  states  and  transitions). 
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This  correspondence  is  especially  meaningful  if  the  lo¬ 
cal  reasoning  can  be  done  for  an  unmodified  program, 
in  which  case  it  becomes  clear  what  the  exact  contri¬ 
bution  of  each  program  step  (by  each  process)  is  to¬ 
wards  the  global  behavior.  In  this  paper  we  provide  a 
practical  tool,  called  a  stack  assertion,  that  provides 
such  an  understanding  for  distributed  or  concurrent 
programs  that  fairly  terminate.  The  stack  assertions 
of  a  program  define  a  mapping,  called  a  fair  termina¬ 
tion  measure,  that  describes  how  close  each  program 
state  is  to  fair  termination. 

The  contributions  of  this  paper  are  both  practical  and 
theoretical.  We  demonstrate  the  usefulness  of  stack 
assertions  by  examples.  For  distributed  or  concurrent 
programs,  our  examples  indicate  a  direct  way  of  con¬ 
tributing  “lack  of  progress  towards  termination”  to 
“progress  towards  unfair  execution”  as  expressed  by 
a  hierarchy  of  unfairness  hypotheses.  Stack  assertions 
form  the  natural  framework  for  expressing  this  hierar¬ 
chy  and  summarize  in  a  single  data  structure  the  in¬ 
formation  obtained  by  the  program  transformations  of 
previous  methods.  Since  the  need  for  transformations 
has  been  eliminated,  stack  assertions  can  be  added  to 
existing  assertional  methods  for  concurrent  and  dis¬ 
tributed  programs. 

There  are  two  theoretical  results  of  this  paper.  The 
first  is  a  new  completeness  proof — substantially  sim¬ 
pler  than  earlier  proofs  that  involve  transfinite  induc¬ 
tion  or  results  from  topology— that  explains  why  a 
fair  termination  measure  always  exists  for  programs 
or  distributed  systems  that  fairly  terminate. 

The  second  result  is  that  by  adding  a  history  vari¬ 
able  to  a  program,  the  fair  termination  measure  can 
be  expressed  by  means  of  stack  assertions  in  any  rea¬ 
sonably  expressive  assertion  language  (i.e.  a  language 
that  includes  arithmetic). 

In  some  earlier  work  on  expressing  assertions  about 
fair  termination  [SdRG89,  Mor90,  MP91],  predicate 
calculus  is  combined  with  fixed-points  and  ordinals. 
For  an  arbitrary  program,  this  calculus  allows  to  char¬ 
acterize  precisely  the  states  from  which  all  infinite 
computations  are  unfair. 

In  the  present  paper  we  show  that  with  the  addition 
of  a  history  variable,  an  assertion  language  contain¬ 
ing  only  predicate  calculus  is  sufficient  for  a  proof  of 
fair  termination  from  the  initial  state.  In  our  method, 
well-foundedness  is  expressed  not  by  fixed-point  logic 
in  program  assertions,  but  as  an  additional  require¬ 
ment  that  a  relati  n,  expressed  by  the  program  as¬ 
sertions,  is  well-founded  (has  no  infinite  descending 
chains.)  Observe  that  adding  history  information  (as 
for  example  is  also  done  in  methods  for  verification 
with  nondeterministic  automata  [AL91,  Sis91])  is  a 
benign  transformation  in  that  it  has  to  be  done  only 


once  and  basically  does  not  change  the  traontional 
structure  of  the  program;  in  particular,  no  additional 
nondeterminism  is  added. 


2  Verification  Methods  for  Fairness 

A  fairness  constraint  partitions  infinite  computations 
into  fair  and  unfair  ones.  In  this  paper  we  shall  con¬ 
centrate  on  strong  fairness,  which  is  one  of  the  most 
important  fairness  concepts.  According  to  this  crite¬ 
rion,  a  computation  is  fair  if  commands  (or  processes, 
statements,  actions,  events,...)  that  are  enabled  in¬ 
finitely  often  are  also  executed  infinitely  often.  (It  is 
assumed  that  the  number  of  different  commands  is  fi¬ 
nite.)  Thus  an  unfair  computation  is  one  where  some 
command  is  enabled  infinitely  often  but  only  executed 
finitely  often.  A  program  P  fairly  terminates  if  ev¬ 
ery  infinite  computation  of  P  is  unfair.  A  verification 
method  for  fair  termination  is  defined  in  terms  of  ver¬ 
ification  conditions  expressed  in  the  style  of  Hoare’s 
logic.  To  be  useful  a  method  must  be  sound,  i.e.  any 
program  for  which  the  verification  conditions  can  be 
satisfied  must  fairly  terminate.  The  method  is  com¬ 
plete  if  the  verification  conditions  can  be  satisfied  for 
any  program  that  fairly  terminates. 

Complete  verification  methods  for  strongly  fair  ter¬ 
mination  are  considered  in  [GFMdRv85,  LPS81, 
SdRG89].  These  methods  are  based  on  helpful  di¬ 
rections,  which  indicate  program  statements  that  are 
being  unfairly  executed.  The  approach  of  helpful 
directions  has  been  successful  at  explaining  many 
fairness  concepts,  such  as  those  involving  general 
state  predicates  [FK84]  or  an  infinite  number  of  com¬ 
mands  [Mai89].  All  these  methods  involve  the  re¬ 
cursive  use  of  proof  rules  that  are  applied  to  trans¬ 
formed  programs.  Thus  they  tend  to  depend  on  par¬ 
ticular  syntactic  properties  of  the  underlying  program 
language  [Fra86,  page  117]  (although  a  way  of  cir¬ 
cumventing  these  syntactic  dependencies  is  indicated 
in  [Fra86,  section  2.4]). 

The  methods  of  explicit  schedulers  de¬ 
veloped  in  [A083,  APS84,  DH86]  involve  transform¬ 
ing  programs  by  adding  auxiliary  variables  that  are 
nondeterministically  assigned  values  determining  fair 
computations.  Because  they  involve  rather  drastic — 
even  “cruel”  [DH86] — program  transformations,  these 
methods  also  deal  with  fairness  in  a  somewhat  indirect 
manner.  For  an  extensive  treatment  of  fairness  based 
on  helpful  directions  and  explicit  schedulers,  see  the 
book  [Fra86]. 

In  [SdRG89]  it  was  shown  how  predicate  calculus  aug¬ 
mented  with  fixed-points  can  be  used  to  express  as¬ 
sertions  about  fair  termination.  This  calculus  can 
express  inductively  definable  relations  [Mos74],  which 
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are  needed  in  their  proof.^  Usually,  however,  asser- 
tional  reasoning  is  based  on  ordinary  predicate  calcu¬ 
lus,  which  corresponds  to  arithmetic  relations  [Rog67]. 
Earlier,  Apt  and  Plotkin,  motivated  by  the  relation¬ 
ship  mentioned  above  between  fairness  and  nondeter¬ 
minism,  gave  a  semantic  model  for  countable  nonde¬ 
terminism  [AP86].  In  addition,  they  provided  a  rel¬ 
atively  complete  proof  system  for  termination,  also 
based  on  fixed-point  logic. 

Using  a  fragment  of  fixed-point  calculus.  Manna  and 
Pnueli  formulated  elegant  proof  rules  for  assertional 
reasoning  about  properties  expressed  in  temporal 
logic.  For  the  problem  of  fair  response  (which  gener¬ 
alizes  fair  termination),  they  exhibited  a  simple  proof 
rule,  which  is  recursively  applied  to  transformed  pro¬ 
grams. 

Morris  [Mor90]  also  used  fixed-point  calculus  in  his 
formulation  of  a  weakest  precondition  semantics  for 
fair  termination  of  tail-recursive  programs. 

The  work  presented  here  is  also  related  to  the  theory 
of  automata  on  infinite  words.  In  fact,  the  condition 
of  fair  termination  is  but  an  instance  of  a  Rabin  pairs 
condition,  see  [KK91],  which  is  a  requirement  in  a  spe¬ 
cial  disjunctive  normal  form  about  the  inftnite  occur¬ 
rence  of  states.  The  proofs  in  the  present  paper  could 
have  been  formulated  for  Rabin  pairs  conditions  (thus 
yielding  a  method  for  general  fairness  [FK84]),  but 
for  simplicity  of  exposition  we  have  used  conditions 
pertaining  to  strong  fairness. 

The  Rabin  progress  measures  in  [KK91]  express 
progress  towards  satisfaction  of  a  Rabin  pairs  condi¬ 
tions.  Applied  to  fair  termination,  a  Rabin  progress 
measure  maps  program  states  into  a  special  kind  of 
colored  trees.  This  gives  a  concise  method  for  fair  ter¬ 
mination  that  does  not  depend  on  program  transfor¬ 
mations  [KK91].  The  method  is  not  entirely  practical, 
however,  because  there  is  no  natural  way  to  describe 
the  mapping  into  the  colored  tree,  which  has  to  be  de¬ 
scribed  explicitly — obstacles  that  are  overcome  in  this 
paper.  For  a  more  detailed  comparison,  see  Section  5. 

A  concept  similar  to  our  stacks  is  used  in  [Saf92], 
where  the  problem  is  to  determinize  an  automaton 
with  a  Streett  condition  (a  special  conjunction)  or, 
equivalently,  to  express  that  a  Rabin  condition  holds 
along  all  computations  of  a  nondeterministic  automa¬ 
ton  by  means  of  a  deterministic  automaton.  For  com¬ 
plementation  of  tree  automata,  the  last  appearance 


*  The  inductively  definable  relatione  are  the  eame  ai  the  11] 
relations.  II]  is  the  class  of  relations  on  the  form  Voi ;  p,  where 
or  is  a  second-order  object  (such  as  an  infinite  computation) 
and  p  is  a  first-order  formula  (such  as  the  one  expressing  that  a 
computation  is  unfair).  The  problem  of  fair  termination  is  fl]- 
complete  as  is  the  problem  of  termination  (of  programs  with 
countable  nondeterminism). 


record  of  [GH82]  serves  a  purpose  different  from  that  of 
our  stacks,  namely  to  keep  enough  information  about 
the  past  to  make  finitely  represented  choices  in  a  win¬ 
ning  strategy  for  a  game  that  is  won  by  satisfying 
a  conjunction  (such  as  a  Streett  condition).  On  the 
other  hand,  stacks  in  the  form  of  Rabin  progress  mear 
sures  can  be  used  to  show  that  there  is  no  need  for 
such  information  when  the  game  is  won  by  a  disjunc¬ 
tion  (Rabin  condition)  [Kla92]. 

Finally,  our  work  is  related  to  more  general  techniques 
for  proving  liveness  properties.  Harel  showed  that 
by  transformations  on  trees  representing  programs 
one  can  do  program  verification  for  all  finite  levels 
of  the  Borel  hierarchy  [Har86].  Using  an  automata- 
theoretic  approach,  Vardi  gave  a  verification  method 
for  very  general  properties,  including  the  Borel  hi¬ 
erarchy  [Var87].  In  Vardi ’s  framework,  progress  is 
measured  relative  to  a  nondeterministic  automaton 
that  defines  incorrect  computations.  In  contrast,  the 
progress  measures  of  [Kla90,  Kla]  are  functions  that 
relate  the  program  state  or  history  to  a  finite  com¬ 
putation  of  a  correctness  specification.  With  this  ap¬ 
proach  nondeterminism  must  be  eliminated,  since  it 
makes  it  difficult  to  relate  program  to  specification  by 
means  of  a  function  (cf.  the  work  [AL91,  KS93,  Sis91] 
on  relating  automata  defining  safety  properties).  In¬ 
stead  more  powerful  limit  conditions  than  those  usu¬ 
ally  studied  (e.g.  Rabin  or  Streett  conditions)  are  used 
to  define  the  infinite  computations  as  limits  of  finite 
ones. 

3  Stack  Assertions 

In  this  section  we  review  the  method  of  Floyd  and  ex¬ 
plain  how  assertions  cam  define  a  measure  of  progress 
for  termination.  We  argue  informally  how  stack  asser¬ 
tions  can  be  used  to  guarantee  fair  termination.  For 
simplicity  we  present  our  examples  using  the  language 
of  guarded  commands,  but  our  technique  is  syntax- 
independent  and  also  applies  to  strong  fairness  ex¬ 
pressed  for  other  formalisms  describing  distributed  or 
concurrent  systems. 

3.1  Floyd’s  Method 

For  programs  occurring  in  practice  it  is  usually 
straightforward  to  quantify  progress  towards  termina¬ 
tion.  This  is  done  in  terms  of  well-founded  sets  as  first 
advocated  by  Floyd  [Flo67].  A  well-founded  set  (W,  >-) 
is  a  set  W  with  a  binary  relation  >>  such  that  there  is 
no  infinite  descending  sequence  wq  >-  wi  >-  ■  ■  ■.  For 
an  example  of  proving  termination,  take  the  program 


Pl:*[i<y  — n:=i-l-l) 
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consisting  of  &  loop  with  a  single  guarded  command, 
which  is  executed  as  long  as  its  guard,  x  <  y,  is  en¬ 
abled  (true).  The  variables  take  on  integer  values. 
To  argue  that  PI  terminates,  we  use  the  mapping 
=  m2Lx{y — X,  0}  from  program  states  into  the  well- 
founded  set  of  natural  numbers  0  <  1  <  2  <  •  •  -.  Here 
and  in  the  sequel,  the  letter  “T”  refers  to  the  hypoth¬ 
esis  that  the  programs  terminates.  The  mapping 
can  be  called  a  termination  measure,  since  its  value  de¬ 
creases  with  each  iteration.  The  existence  of  a  termi¬ 
nation  measure  guarantees  that  P  terminates,  be¬ 
cause  an  infinite  computation po,Pi,  -  ••  would  produce 
an  infinite  descending  sequence  |i^(po)  >  P^(pi)  > 
•  •  -,  contradicting  the  well-foundedness  of  the  natural 
numbers. 

In  practice,  the  termination  measure  is  expressed 
by  annotating  the  program  with  assertions.  For  PI  a 
single  assertion  suffices: 

Pl':*[  ^T  :  max{y- x,0}j 
X  <y  — ►x.=  x-|-l  ] 


Here  ^T  :  max{y  —  x,  0}  j  is  a  simple  stack  assertion. 
It  asserts  that  for  the  termination  hypothesis,  also 
called  the  T -hypothesis,  the  value  of  the  termination 
measure  is  max{y  —  x,0}  whenever  the  loop  is  to 
be  executed.  Thus  it  could  be  called  a  loop  variant^ 
as  opposed  to  a  loop  invariant.  The  latter  expresses  a 
relationship  between  variables  that  is  preserved  under 
iterations. 

3.2  Fair  Termination 

With  a  trivial  modification  of  PI,  proving  termination 
is  suddenly  more  intricate.  Consider  the  program 

P2:  la'-  X  <  y  — *x:=x-|-10 

til  X  <y  —*  skip  ] 

where  the  loop  is  executed  as  long  as  x  <  y  by  execu¬ 
tion  of  either  of  the  guarded  commands  x  <  y  — »  x  ;= 
X  1  and  X  <  y  — »  skip.  The  choice  is  made  nonde- 
terministically.  This  program  will  not  terminate  if  the 
second  command  is  always  chosen  from  some  point  on. 
Under  assumption  of  (strong)  fairness,  however,  P2 
always  terminates,  because  in  an  infinite  computation 
of  P2,  ta  is  only  executed  finitely  often,  but  enabled 
infinitely  often;  thus  the  computation  is  unfair  with 
respect  to  command  ta. 


^The  tenna  vsrisnt  /uneiion  or  ho%ni  function  are  alao 
uaed  [GriSl]. 


The  preceding  argument  was  formulated  in  terms  of 
infinite  computations.  In  contrast,  assertional  reascm- 
ing  deals  only  with  program  states  and  single  transi¬ 
tions.  The  key  to  assertional  reasoning  about  fairness 
is: 

If  there  is  no  progress  towards  termination, 
this  can  be  attributed  to  some  statement  be¬ 
ing  executed  unfairly. 

For  example,  when  ft  is  executed,  the  T-hypothesis  is 
not  active  since  there  is  no  progress  towards  termina¬ 
tion.  Instead,  progress  towards  executing  ta  unfairly 
can  be  measured.  To  do  this  we  reformulate  the  as¬ 
sertion  by  including  the  unfairness  hypothesis,  called 
the  ta-hypothesis,  that  ta  is  executed  unfairly.  Syn¬ 
tactically,  this  is  done  by  putting  “ta”  on  top  of  the 
underlying  T-hypothesis;  thus  we  write 


T  :  max{y-  x,0}J  ' 

This  stack  assertion  expresses  a  hierarchy  in  which 
the  T-hypothesis  is  the  underlying  hypothesis  and  the 
role  of  the  fa-hypothesis  is  to  explain  progress  when 
the  underlying  hypothesis  can  not.  The  annotated 
program  is  now: 


Progress  is  made  towards  unfair  execution  in  terms 
of  the  fa-hypothesis  whenever  ta  is  enabled  but  not 
executed.  Note  that  for  any  iteration,  either 

(Va)  ta  is  enabled  and  not  executed,  and  the  under¬ 
lying  T-measure  remains  constant  (when  ft  is 
executed);  or 

(Vt)  measure  p"^  decreases  (when  ta  is  executed). 

We  can  now  argue  that  P2  fairly  terminates  in  terms 
of  the  local  conditions  (V.)  and  (Vy)  ua  follows.  In  an 
infinite  computation,  either  from  some  point  on  (V.) 
always  applies,  or  infinitely  often  (Vy)  applies.  In  the 
first  case,  ta  is  always  enabled  but  never  executed. 
Hence  the  computation  is  unfair  with  respect  to  ta. 
In  the  second  case,  it  holds  that  each  time  (Vy)  ap¬ 
plies,  p"^  is  decreased,  and  at  the  other  times,  when 
(Va)  applies,  p"^  is  unchanged.  This  yields  an  infi¬ 
nite  decreasing  sequence  of  natural  numbers,  which  is 
a  contradiction. 

Thus  we  have  proved  that  for  any  infinite  computation 
of  P2,  only  the  first  case  is  possible,  i.e.  P2  fairly  ter¬ 
minates.  This  argument  will  later  be  generalized  to  a 


f _ _ ) 

It  :  max{y  —  x,0}  J 
ta'.x  <y  -*  X  :=  x-k  1  O 
tt:  X  <  y  — ►  skip  ] 
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soundness  result,  which  shows  that  a  verihcation  con¬ 
dition,  similar  to  the  local  conditions  above,  always 
implies  fair  termination.  For  now,  however,  we  moti¬ 
vate  this  general  result  by  looking  at  more  examples. 

3.3  Progress  Measures  for  Un&irness  Hy¬ 
potheses 

A  more  complex  situation  arises  if  is  sometimes  not 
enabled  when  'Ik  Unexecuted.  In  this  case  the  stack 
assertion  cannot  be  applied,  be¬ 

cause  condition  (V^  is  then  sometimes  not  fulfilled. 
If  the  program  fairly  terminates,  however,  we  can  use 
a  progress  measure  for  the  fg-bypothesis.  For  an 
example  of  this  situation,  take: 


P3:*Cfo:»<I/Az  mod  117  =  0  — n:=*-f-10 
tk'.x<y  -*z:=z -11 


This  program  fairly  terminates,  because  for  any  in¬ 
finite  computation,  only  be  executed  finitely 

often  and  the  value  of  z  decreases  by  one  each  time 
It  is  executed;  thus  ta  is  enabled  infinitely  often.  We 
might  annotate  the  program  as  follows: 


PZ'x* 


r  f  ^  ^ 

'•  It  :  max{y-a:,0}J 
ta‘‘  X  <  y  A  z  mod  117  =  0 
Ik’.  x<y 


x+1  O 
z-l  1 


3.4  Unfairness  of  Several  Commands 

An  even  more  challenging  situation  occurs  when  more 
than  one  command  may  be  executed  unfairly.  If  we 
add  an  empty  guarded  command  to  P3,  we  obtain: 


P4:  *[  ta’.x<yA  z  mod  117  =  0  — ►z:=*-|-lD 
tk’.x<y  -»  z  :=  z  -  1  O 

tc’.  X  <y  — ►  skip  ] 


This  program  fairly  terminates,  because  any  infinite 
computation  is  unfair  with  respect  to  either  or  />. 
To  see  this  we  use  the  loop  variant  from  P3'  modified 
to  explain  the  lack  of  progress  when  U  chosen  for 
execution.  In  that  case  there  is  progress  neither  to¬ 
wards  termination  nor  towards  executing  ta  unfairly. 
But  there  ts  always  progress  towards  something  when 
a  program  fairly  terminates;  in  fact,  when  is  exe¬ 
cuted,  /)  is  a  candidate  for  unfair  execution  because 
it  is  enabled  but  not  executed.  Thus  we  can  put  the 
f»-hypothesis — that  tk  is  executed  unfairly — on  top  of 
the  T-  and  fg-hypotheses.  The  annotated  program 
then  becomes: 

f _ _ 

P4':*C  fg:zmodll7 

It  :  max{y  —  x,0}J 

la’  X  <  y  A  z  mod  117  =  0  —*  x  :=x  +  l  O 
Ik’.  X  <y  — f  z  :=  z  —  1  □ 

tgi  X  <  y  — ►  skip  ] 


Here  a  :  z  mod  117  denotes  that  a  progress  measure 
=  z  mod  117  is  associated  with  the  fg-hypothesis. 
The  measure  p^*  is  a  measure  of  how  close  P3  is  to  a 
state  in  which  fg  is  enabled.  For  each  iteration  of  the 
loop,  either 

(Vg)  measure  p^  is  unchanged,  la  is  not  executed, 
and  either  the  value  of  z  was  0  (mod  117)  be¬ 
fore  the  execution  of  />,  in  which  case  /g  was 
enabled,  or  the  value  of  z  (mod  117)  was  be¬ 
tween  1  and  116  and  decreases  by  1;  or 
(Vij>)  measure  decreases. 

The  local  conditions  (Vg)  and  (V^)  ensure  that  an  infi¬ 
nite  computation  is  unfair  with  respect  to  la-  Consider 
the  corresponding  infinite  sequence  of  stacks.  It  must 
be  the  case  that  from  some  point  on,  (Vg)  applies  to 
each  transition.  Thus  /g  is  only  executed  finitely  often. 
If  from  some  point  on  it  is  never  enabled,  then  p^*  de¬ 
creases  for  each  iteration  thereafter,  contradicting  the 
well-foundedness  of  the  natural  numbers.  Therefore, 
ta  is  enabled  infinitely  often,  and  we  conclude  that  any 
infinite  computation  is  unfair  with  respect  to  la  ■ 


This  annotation  can  be  used  as  an  argument  why  PA 
fairly  terminates  in  a  way  similar  to  the  previous  ar¬ 
guments. 

Note  that  if  earlier  methods  involving  recursive  proof 
rules  had  been  used  instead  to  show  that  PA  fairly 
terminates,  it  would  have  been  necessary  to  reason 
about  three  different  programs;  the  original  and  two 
syntactically  derived  programs. 

4  Verification  Conditions,  Soundness, 
and  Completeness 

In  the  preceding  section  we  developed  a  notation  for 
reasoning  about  fairness.  For  each  program  we  con¬ 
sidered  an  arbitrary  infinite  computation  and  argued, 
using  the  associated  stacks,  that  the  computation  was 
unfair.  The  rationale  for  using  stack  assertions,  how¬ 
ever,  is  to  avoid  reasoning  about  infinite  computa¬ 
tions.  So  to  obtain  an  assertional  verification  method, 
we  formulate  verification  conditions  for  the  stacks  ex¬ 
pressed  by  the  assertions.  When  these  conditions  ate 
fulfilled  for  all  transitions,  we  say  that  the  stack  as¬ 
sertions  define  a  fair  iermination  measure.  We  show 
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that  if  a  program  has  a  fair  termination  measure,  then 
it  fairly  terminates.  Thus  the  verification  method  of 
stack  assertions  is  sound. 

We  also  give  a  completeness  result:  when  a  pro¬ 
gram  fairly  terminates,  it  has  a  fair  termination  mear 
sure.  Moreover,  we  show  that  under  certain  conditions 
(which  are  fulfilled  if  a  history  variable  is  added  to  the 
program)  then  the  fair  termination  measure  can  be  ex¬ 
pressed  as  program  assertions  in  a  reasonably  powerful 
assertion  language. 

4.1  The  Verification  Conditions 

To  formulate  the  verification  conditions  we  need  a  few 
definitions.  A  program  P  defines  a  transition  relation 
—*  on  &  countable  set  of  program  states;  moreover,  P 
defines  a  set  of  initial  program  states  and  a  finite  set  of 
commands.  A  command  (or  an  action,  a  process,  an 
event,. . . )  is  designated  by  a  label  /,  and  P  defines  for 
each  program  state  whether  /  is  enabled  or  disabled. 
A  iransHion  p  — » p'  describes  the  execution  of  exactly 
one  command,  which  is  enabled  in  p.  A  paiA  from 
Po  fo  Pn  is  a  sequence  of  states  poi  •  •  -  iPn  such  that 
Pi  Pi+i  for  i  <  n;  an  infinite  path  po,Pi,-..  is 
defined  in  a  similar  way.  A  computation  is  a  finite 
or  infinite  path  starting  in  an  initial  state.  A  state 
p'  is  reachable  from  state  p  if  there  is  a  path  from 
p  to  p'.  We  assume  without  loss  of  generality  that 
any  program  state  is  reachable  from  an  initial  state. 
(In  practice,  conventional  assertional  methods  can  be 
used  to  describe  the  reachable  program  states,  since 
finite  sequences  of  program  states  can  be  encoded  as 
numbers;  see  [MP91].) 

A  progress  hypothesis  or  a-hypothesis  is  either  an  un¬ 
fairness  hypothesis,  on  the  form  lot  t  :w  (with  a  =  f), 
or  the  T-hypothesis,  on  the  form  T  :  w  (with  a  =  f), 
where  w  is  an  element  of  a  well-founded  set  {W,  >-). 
A  stack  assignment  is  a  mapping  that  maps  each  pro¬ 
gram  state  p  to  a  list  p(p)  of  progress  hypotheses  such 
that  the  T-hypothesis  is  at  level  0,  i.e.  at  the  bottom. 
(It  can  be  assumed  that  all  the  hypotheses  are  dif¬ 
ferent,  i.e.  there  is  at  most  one  /-hypothesis  in  p(p} 
for  each  /.)  The  stack  assertions  of  a  program  define 
a  stack  assignment  according  to  the  semantics  of  the 
logical  language  of  the  assertions.  (The  exact  corre¬ 
spondence  is  of  no  importance  here.)  For  an  hypothe¬ 
sis  a  ;  w  in  p(p),  where  a  is  a  label  or  T,”  the  value 
w  is  called  the  a-measvre  at  p  and  is  denoted  p‘*(p). 

Note  that  the  definitions  above  are  not  dependent  on 
the  particular  syntax  of  guarded  commands,  but  de¬ 
pend  only  on  the  notions  of  commands  or  actions  be¬ 
ing  “enabled”  and  “executed.”  Thiu  our  soundness 
and  completeness  results  apply  to  strong  fairness  in  all 
transition  systems.  For  example,  our  method  applies 
to  nested  commands  (“all-level  fairness,”  see  [Fra86]). 

The  verification  conditions  are  expressed  in  terms  of 


active  and  non-invalidated  hypotheaes:  essentially,  an 
/-hypothesis  is  aetiue  if  progress  towards  unfair  exe¬ 
cution  of  /  is  made,  and  it  is  nom-invalidated  if  /  is  not 
executed.  The  T-hypothesis  is  active  if  the  program 
gets  closer  to  termination;  the  T-hypothesis  is  always 
considered  non-invalidated. 

The  verification  conditions  can  now  be  stated  some¬ 
what  informally: 

(Vp)  On  any  program  transition, 

•  there  is  some  active  hypothesis; 

•  the  active  hypothesis  and  the  ones  below 
are  non-invalidated; 

•  and  the  stack  does  not  change  below  the 
active  hypothesis. 

The  meaning  of  this  is  illustrated  in  Figure  1.  Here  the 
program  transition  is  p  p'.  The  active  hypothesis  a 
is  at  the  same  level  in  the  stacks  p(p)  and  p(p'),  and 
everything  below  (denoted  by  5  in  the  figure)  remains 
unchanged.  Formally,  the  verification  conditions  (Vp) 
are: 

(Va)  Some  o-hypothesis  is  active,  i.e.  either 

•  or  is  a  label  /  and  command  /  is  enabled 
(in  state  p  or  p'),  or 

•  U)  =  /i“(p)  and  u;'  =  p"(p')  are  defined 
with  ui  >-  w\ 

(Vnoiii)  Every  hypothesis  below  and  including  the  a- 
hypothesis  is  non-invalidated  i.e.  none  of  these 
hypotheses  is  the  /-hypothesis,  where  /  is  the 
command  executed  in  going  from  p  to  p'. 
(Vnoc)  The  stack  does  not  change  below  hypothesis 
a. 

The  contents  of  the  stack  above  or  may  change  in 
any  way.  When  the  stack  assignment  /i  satisfies 
these  conditions  for  all  program  transitions,  we  say 
that  (p,  (W,  >-)) — or  fi  (when  the  well-founded  relar 
tion  {W,  >-)  is  understood  from  the  context) — ^is  a  fair 
termination  measure. 

4.2  Example 

Here  is  an  argument  explaining  why  the  stack  asser¬ 
tion  of  PA'  satisfies  (Vp).  Consider  an  iteration  not 
leading  to  termination.  There  are  three  cases  depend¬ 
ing  on  which  command  is  executed: 

ta'.  The  T-hypothesis  is  active,  because  ft^  = 
max{y  —  x,  0}  decreases  (since  x  <  y  holds  be¬ 
fore  /a  is  executed).  There  is  nothing  beneath 
the  T-hypothesis  to  check. 

/»:  Below  the  /a-hypothesis,  the  stack  remains  un¬ 
changed  and  the  T-hypothesis  is  not  invalidated. 
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active  levef 


lev^  0 


Or  =  /  is  enabled  or  w  >  w' 


irrelevant 


the  command 
executed  is  not 
in  here 


Figure  1;  Verification  condition. 


least  with  respect  to  a  cross-product  ordering  on  the 
progress  measures  of  P'. 

4.5  Relative  Completeness 


The  /a-hypothesis  is  active,  because  is  not  ex¬ 
ecuted,  and  before  execution  of  /«,  either  2  =  0 
(mod  117)  holds,  i.e.  ta  is  enabled,  or  z  ^  0 
(mod  117)  holds,  i.e.  =  z  mod  117  decreases. 

£e:  The  stack  is  unchanged  below  the  /^-hypothesis. 
The  /t-hypothesis  is  active,  because  is  enabled 
but  not  executed.  The  /a-hypothesis  is  non- 
invalidated,  because  ta  is  not  executed. 

4.3  Soundness 

Theorem  1  (Soundness  of  Fair  Termination 
Measures)  If  P  has  a  fair  termination  measure,  then 
P  fairly  terminates. 

(See  the  Appendix  for  all  proofs.) 

4.4  Completeness 

Theorem  2  (Completeness  of  Fair  Termination 
Measures)  If  P  fairly  terminates,  then  P  has  a  fair 
termination  measure. 

To  prove  Theorem  2,  we  first  present  a  simple  com¬ 
pleteness  proof,  which  applies  to  programs  that  are 
tree-like.  A  program  is  tree-like  if  it  has  a  single  ini¬ 
tial  state  p°  and  if  every  state  ]/,  except  p°,  has  exactly 
one  predecessor,  i.e.  there  is  exactly  one  p  such  that 
there  is  a  transition  p  —*  pf.  Any  program  can  be 
made  tree-like  by  adding  a  history  variable  recording 
the  past  sequence  of  program  states. 

Theorem  3  If  P  fairly  terminates  and  is  tree-like, 
then  P  has  a  fair  termination  measure. 

To  prove  Theorem  2  for  an  arbitrary  program  P,  we 
apply  Theorem  3  to  the  tree-like  program  P'  that  is 
obtained  by  adding  a  history  variable  to  P.  The  value 
of  the  progress  measure  for  a  state  p  of  P  is  then 
chosen  as  the  least  value  of  the  progress  measure  of 
states  in  P'  that  correspond  to  p;  here  “least”  means 


It  is  not  hard  to  see  that  the  completeness  result  in 
Theorem  3  can  be  sharpened  to  show  that  an  effec¬ 
tively  represented  fair  termination  measure  exists  for 
an  effectively  represented  program  P  (a  program  that 
has  a  recursive  transition  relation^  and  a  recursive 
function  that  for  all  pf  defines  the  state  p  (if  it  ex¬ 
ists)  such  that  p  — »  p'.)  In  fact,  this  measure  can 
be  obtained  uniformly  from  P.  To  see  this  we  define 
a  fair  termination  semi-measure  (p,  {W,  >-))  to  be  a 
fur  termination  measure  except  that  W  need  not  be 
well-founded;  thus  p  is  just  required  to  satisfy  the  ver¬ 
ification  conditions. 


Theorem  4  There  is  a  recursive  function  h  that  given 
an  index  for  a  tree-like  program  P  gives  indices  for  a 
fair  termination  semi-measure  (p,  {W,  >-)),  where  both 
p  and  {W,  >-)  are  recursive.  Moreover,  (p,  {W,  >-))  is  a 
fair  termination  measure  (i.e.  (W,  >-)  is  well-founded) 
iff  P  is  fairly  terminating. 

This  theorem  gives  an  explicit  reduction  of  the  fair  ter¬ 
mination  problem  to  a  classical  Ilx-complete  problem 
of  whether  a  recursive  relation  is  well-founded.  More¬ 
over,  it  shows  that  if  the  assertional  language  includes 
usual  predicate  logic  on  numbers  (and  therefore  all  re¬ 
lations  in  the  arithmetic  hierarchy  by  a  fundamental 
result  of  Godel,  see  [RogfiT]),  then  there  exists  a  stack 
assertion 

’CtN  •  WJV' 
ax  :  rvi  ' 

,  T  :  wt  • 


*  A  ree«r«i*e  rtUtion  it  also  called  a  recuraively  computable 
relation,  aee  [Rog6t]. 
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where  the  o’e  end  w*s  ate  definable  in  the  aasertional 
logic,  that  aatisfies  the  verification  conditions. 

Thus  we  obtain: 

Corollary  1  (Relative  Completeness  of  Stack 
Assertions)  If  the  assertional  language  contains 
predicate  calculus  and  if  P  fairly  terminates,  then  P 
can  be  annotated  with  stack  assertions  in  terms  of  the 
program  history  such  that  the  verification  conditions 
are  satisfied. 

5  Discussion 

The  results  presented  here  are  related  to  the  method 
of  helpful  directions  [Fra86,  GFMdRvSS,  LPS81]  and 
the  Rabin  measures  of  [KK91]. 

Formulated  in  our  terminology,  the  method  of  helpful 
directions  is  used  to  identify  one  level  of  the  fair  ter¬ 
mination  measure  at  a  time.  For  example,  one  first 
identifies  subsets  of  program  states  corresponding  to 
a  constant  measure.  Then  the  program  is  trans¬ 
formed  into  several  new  programs,  each  corresponding 
to  a  subset.  The  states  of  each  derived  program  are 
then  further  partitioned  according  to  unfairness  hy¬ 
pothesis  (helpful  directions)  of  the  first  level  to  yield 
more  subsets,  which  are  expressed  as  more  derived 
programs. 

Our  approach  is  also  related  to  the  Rabin  progress 
measures  of  [KK91,  Kla90].  A  Rabin  progress  mesr 
sure  is  defined  as  a  mapping  from  the  program  states 
into  a  colored  tree.  This  mapping  can  be  described  in 
program  assertions  by  specifying  the  progress  values 
for  each  program  state.  The  problem  is  that  the  col¬ 
ored  tree  has  to  be  explicitly  described  (as  it  was  done 
in  an  example  given  in  [KK91]).  In  contrast,  the  stack 
assertions  given  in  this  paper  are  self-contained. 

There  are  some  technical  differences  that  have  been 
introduced  to  make  stack  assertions  more  useful  for 
program  annotation: 

•  Two  stacks  may  contain  the  same  progress  values, 
but  be  colored  differently.  In  a  Rabin  progress 
measure  the  coloring  is  a  function  of  the  progress 
values.  Thus  it  is  not  possible  to  translate  directly 
a  fair  termination  measure  into  a  Rabin  progress 
measure. 

•  For  a  Rabin  progress  measure,  satisfaction  of  an 
enabling  condition  is  expressed  in  terms  of  the 
new  state.  For  stack  assertions,  the  satisfaction 
of  the  enabling  condition  is  considered  in  terms 
of  the  old  state  and  the  new  state. 

•  There  may  be  several  choices  for  an  active  hy¬ 
pothesis.  For  Rabin  progress  measures  the  active 
hypothesis  is  uniquely  determined  for  each  tran¬ 
sition. 
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Appendix:  Proofs 

Proof  of  Theorem  1 

Assume  that  P  has  a  fair  termination  measure  fi(jp) 
and  that  po,Pi,---  is  an  infinite  computation.  We 
must  prove  that  poiPi> "  '  is  unfair.  To  see  this  we 
let  Ki  be  the  level  of  the  active  hypothesis  of  the  tran¬ 
sition  Pi  —*  pi.fl  and  we  define  k  =  iiminfi_oo  Ki,  i.e. 
K  is  the  least  value  of  Ki  that  occurs  infinitely  often. 
Then  from  some  point  on  Ki  is  always  at  least  k,  i.e. 
there  is  a  K  such  that  for  all  %  >  K,  Ki  >  k. 

It  is  not  hard  to  see  that  k  >  0;  in  fact,  if  k  was  0,  then 
the  values  of  the  T-measure  would  form  a  sequence 
M^(Po)  t  P^(Pi)  h  •  •  •  (by  (Va)  and  (Vnoc)).^  where 
infinitely  often  the  inequality  is  strict,  namely  each 
time  Ki  =  0.  This  contradicts  that  (W,  >-)  is  well- 
founded. 

Thus  K  >  0  and  there  is  an  /  such  that  for  all 
i  >  K,  the  hypothesis  at  level  k  is  an  f-hypothesis 
(by  (Vnoc))  snd  this  hypothesis  is  non-invalidating 
(by  (Vnoiii)).  It  follows  that  i  is  executed  only  finitely 
often.  To  see  that  the  computation  is  unfair  with  re¬ 
spect  to  t,  we  now  only  have  to  prove  that  t  is  enabled 
infinitely  often. 

Assume  the  opposite  is  true.  Thus  for  some  H  >  K, 
it  holds  for  all  i  >  H  that  t  is  not  enabled  and 

*11/  y  w'  means  that  ui  =  in'  or  tv  y  w'. 
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Figure  2;  Soundness. 
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Figure  3:  Initial  stack. 


the  /-hypothesis  has  the  form  t  :  Wi.  As  indicated 
in  Figure  2,  the  values  lu,-  =  /i*(pi),  «  >  H,  give 
rise  to  an  infinite  descending  sequence  in  W,  because 
WH  h  ^H+i  h  •  •  •  (by  (Va)  and  (Vnohi))  with  strict 
inequalities  whenever  k,  =  k.  This  contradicts  that 
(W,  y)  is  well-founded. 

Proof  of  Theorem  3 

Assume  that  P  fairly  terminates  and  that  there  are  N 
different  commands.  The  proof  is  by  a  construction 
that  defines  the  stack  /i(p')  in  terms  of  the  stack  /i(p) 
when  there  is  a  transition  p  — »  p'.  Because  the  pro¬ 
gram  is  tree-like,  this  construction  will  define  a  unique 
value  of  fi  for  all  p.  The  progress  measures  of  the  hy¬ 
potheses  take  on  values  in  a  countable  set  W  equipped 
with  a  relation  >-.  Both  W  and  >-  are  initially  empty. 
The  stack  of  is  as  illustrated  in  Figure  3.  Here  we 
created  at  levels  1  to  an  hypothesis  for  each  com¬ 


mands  .  The  order  of  the  hypotheses  does  not  matter 
at  this  point.  Each  instance  of  new  means  that  a  new 
element  is  added  to  W.  Hence  creating  the  stack  /i(p°) 
results  in  there  being  N  +  1  elements  in  W,  whereas 
>-  remains  empty. 

When  we  create  the  stack  p(p)  and  use  new  to  create 
a  new  element  w  at  level  k,  we  define  r(w)  =  p  and 
A(w)  =  K.  Thus  ((w)  denotes  the  program  state  where 
w  is  first  used,  and  A(u;)  denotes  the  level  where  w  is 
used. 

Now  assume  that  p(p)  has  been  defined  and  that  there 
is  a  transition  p  — »  p'  with  i  denoting  the  command 
that  is  being  executed.  The  idea  behind  the  construc¬ 
tion  of  p(p')  is  to  keep  as  much  of  /i(p)  as  possible.  To 
state  this  more  precisely  we  say  that  an  /'-hypothesis 
in  p(p)  is  naturally  active  if  /'  is  enabled  in  p  or  p'  and 
the  /'-hypothesis  is  below  the  /-hypothesis. 

Case  1  If  there  is  a  naturally  active  hypothesis,  let  a 
be  the  naturally  active  hypothesis  at  the  lowest  level. 
The  new  stack  becomes  as  illustrated  in  Figure  4.  Here 
everything  below  a,  indicated  by  5,  is  preserved.  Also, 
the  hypotheses  above  5  are  preserved,  but  their  mea¬ 
sures  all  change  to  new  values. 

Case  2  If  there  is  no  naturally  active  hypothesis,  we 
let  a  be  such  that  the  a-hypothesis  is  the  one  just  be¬ 
low  the  /-hypothesis.  Note  that  it  may  happen  that 
a  =  T.  The  a-measure  tidees  on  a  new  value  u/,  and 
we  add  u;  >-  w'  to  the  relation  >-  and  say  that  a  is 
forced  active)  in  addition,  the  hypotheses  above  a  are 
rotated  one  step  downwards:  Note  that  the  /  is  moved 
upwards  (unless  there  is  only  one  unfairness  hypoth¬ 
esis  in  the  stack)  and  that  it  is  the  only  hypothesis 
moved  upwards. 

Whether  p(j/)  is  constructed  according  to  Case  1  or 
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/^(p)  p(p') 


Claim  2  Let  w  and  w'  be  elements  of  W  such  that 
w  >-  w'  and  let  k  =  A(w).  Let  a  by  the  hypothesis  at 
level  K  in  /i(i(w)). 

(a)  There  is  a  path  =  po,..  .p„  with  po  =  *(«') 

and  p„  =  i{w')  such  that  the  active  level  for  p,-  — ►  p<+j 
is  greater  than  k  for  t  <  n  —  1,  and  such  that  for 
Pn-i  —*■  Pn  the  hypothesis  a  is  forced  active. 

(b)  Moreover,  no  command  of  an  hypothesis  at  or  be¬ 
low  K  is  enabled  along  . 

Proof  (a)  By  Clum  1,  every  p  such  that  p“(p)  =  to 
is  reachable  from  i{w)  along  a  path  where 

•  or  is  at  level  k, 


Figure  4:  New  stack  in  Case  1. 
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Figure  5:  New  stack  in  Case  2. 


Case  2  above,  the  requirements  (Va),  (VNoni)i  ^d 
(Vnoc)  be  seen  to  be  satisfied  for  the  transition 
p—*pf.  Note  also  that  when  a  is  an  active  hypothesis, 
then  there  are  no  hypotheses  below  that  can  be  active. 
Thus  to  finish  the  completeness  proof  we  only  need  to 
show  that  {W,  >-)  is  well-founded.  We  use  the  following 
properties; 

Claim  1  If  p  -♦  p',  i{w)  ^  p',  and  p‘’(p')  =  w,  then 
p°'(p)  =  to  and  the  position  of  the  a-hypothesis  did 
not  change  on  p  — »  p'.  Moreover,  if  a  is  a  label  (, 
then  t  is  not  enabled  and  not  executed  on  p  — »  p'. 
Also,  the  hypothesis  just  above  the  a-hypothesis  in 
/i(p)  does  not  change  position  and  it  is  non-invalidated 
on  p  — » p'. 

Proof  By  considering  Case  1  and  Case  2  above.  □ 


•  if  a  =  /  ^  T,  then  i  is  not  executed  and  t  is  not 
enabled,  and 

•  has  the  constant  value  w. 

Since  P  is  tree-like,  there  is  a  unique  path  po, .  -  ■ ,  Pn-i 
with  Po  =  t(u;)  such  that  there  is  a  transition  Pn-i  —* 
pn  =  t(u;'),  where  p(pn)  is  constructed  according  to 
Case  2  and  a  is  the  hypothesis  forced  active. 

(b)  This  follows  from  the  choice  of  active  hypothesis 
in  Case  1  and  Case  2.  □ 

Now  assume  that  there  is  an  infinite  descending  se¬ 
quence  Wo  >-  u)i  >-  •  •  •  in  W.  By  (a)  of  the  Claim,  an 
infinite  path  V  containing  ((wo),  t(u;i), . . .  can  be  put 
together  from  the  paths  p** >“’*+> .  Along  this  path  the 
active  level  is  always  at  least  k  =  A(u;o)  =  A(wi)  =  •  •  •. 
Let  a  be  the  hypothesis  at  level  ic.  The  commands 
that  are  executed  infinitely  often  are  above  a,  since 
(Vnodi)  is  satisfied.  Also  any  command  t  that  is  ex¬ 
ecuted  only  finitely  often  is  eventually  at  level  k  or 
below,  because  from  the  point  where  f'  is  no  longer 
executed,  the  f'-hypothesis  can  only  move  downwards 
in  the  stack  and  will  eventually  settle  at  some  level; 
this  level  is  at  most  k,  because  the  hypotheses  above 
K  are  rotated  infinitely  often,  namely  each  time  a  is 
forced  active. 

By  the  assumption  that  P  fairly  terminates,  there  is  a 
command  /  that  is  executed  finitely  often  and  enabled 
infinitely  often.  By  the  previous  argument,  the  /- 
hypothesis  is  at  level  k  or  below.  But  the  /-hypothesis 
being  infinitely  often  enabled  then  contradicts  (b)  of 
the  Claim.  Hence  there  are  no  infinite  descending  se¬ 
quences  in  W,  i.e.  {W,  >-)  is  well-founded.  □ 


Proof  of  Theorem  2 

The  idea  of  the  proof  is  similar  to  the  use  of  the  Sewing 
Lemma  in  [Kla92]  for  the  immediate  determinacy  re¬ 
sult  of  certain  infinite  games. 
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Assume  that  P  faiily  teiminates.  Also  sssume  that 
there  is  a  function  C  such  that  on  any  transition 
p  —*  j/,  the  value  C(p')  denotes  the  command  exe¬ 
cuted  in  going  from  p  io  pf  (the  program  state  space 
and  transition  relation  crm  always  be  extended  to  con¬ 
tain  this  information).  By _adding  a  history  variable 
to  P,  we  obtain  a  program  P,  which  also  fairly  termi¬ 
nates.  A  state  of  P  is  on  the  form  or  =  (pi, . . .  ,Pn)  and 
the  transitions  of  P  are  on  the  form  (pi,-..iPn)  -+ 
(pi.  •  •  ■  .Pf»+i)i  where  p„  -*  Pn+i  is  a  transition  of  P. 
The  initial  state  of  P  is  (p°),  where  p°  is  the  initial 

state  of  P.  For  <r  =  (pi . p„)  define  p<r  =  p„. 

The  set  of  states  of  P  form  a  tree  with  root  (p°).  If 
p  —*  p'  and  per  =  p,  then  the  state  a  •  pf  (which  is  list 
gotten  by  appending  pf  to  the  right  end  of  <r)  is  a  child 
of  <r.  A  state  is  an  ancestor  of  a  state  (s'  if  there 
are  cro, . . . ,  (Tn  such  that  is  a  child  of  <r,  for  t  <  n 
and  (To  =  <r  and  <t„  =  <r'.  Define  C{a)  =  C{pa).  Let 
p  designate  the  fairness  measure  given  by  the  com¬ 
pleteness  proof  of  Theorem  3.  The  mapping  p 
can  be  specified  by  a  mapping  a  that  to  each  a  as¬ 
sociates  a  list  a(<r)  =  (T,/i, . . .  specifying  the 
ordering  of  the  hypotheses  in  the  stack  p((r)  and  by 
a  mapping  0  •.  P  —*  specifying  for  each  <r  a  list 
w  =  (wo, . . . ,  wn)  denoting  the  values  of  the  progress 
measures  at  levels  0  to  TV  -f-  1.  For  a  list  w,  the  Pth 
component  is  denoted  w[t]  and  the  sublist  consisting 
of  components  from  i  to  j  is  denoted  w[i..j].  We  may 
assume  that  {W,  >-)  is  totally  ordered,  i.e.  is  a  well¬ 
ordering.  We  define  an  ordering,  also  denoted  >-,  on 
by  w  >-  w'  if  for  some  »,  w[i]  >  w'[t],  and  for 
all  j  <  i,  w(j]  =  w'[/].  Then  >-  is  a  well-ordering. 
Now  define  6(p)  =  d(e)  and  a(p)  =  a(<r),  where  <r 
is  chosen  such  that  ptr  =  p  and  0{(t)  is  minimal  with 
respect  to  >-. 

Claim  3  If  fl(<r)[0..n]  =  6((T')[0..n],  then  a(<T)[0..n+ 
1]  =  a^((T')[0..n-|-l]. 

Proof  This  follows  from  Claim  1.  □ 

For  w,  w'  €  ,  define  |w,  w'|  =  h,  where  h  is  max¬ 

imal  such  that  for  all  j  <  h,  w^j]  =  w'[j]. 

Now  consider  a  transition  p  —*  pf  oi  P  and  let  us 
prove  that  there  is  an  a-hypothesis  such  that  (Va), 
(VNoni)i  and  (Vnoc)  are  fulfilled.  Let  w  =  0{p)  and 
w'  =  6(p').  Then  w  =  0{&)  for  some  a  such  that 
pa  —  p.  Let  w"  =  9((r  •  p').  By  (kfinition  of  ®(p'), 
w"  y  w'.  Also,  let  o'  be  such  that  6{a'  •  p')  =  9(^). 

There  are  two  cases; 

Case  w  >-  w':  Let  h  =  lw,w'|.  By  Claim  3, 
a(p)[0..A+  1]  =  a(p')[0../» -1-  1].  Thus  a(p)[A-H]  is 
active  and  the  stack  below  is  unchanged,  whence  (Va) 
and  (Vnoc)  are  satisfied. 

By  definition  of  A,  6(a'  ■  p')[0..A]  =  fl(<T)[0..A].  Thus 
the  values  in  ^{a'j/)[0..h]  are  created  in  an  ancestor 


o{  o'  'pf  and  therdbre  5(or')[0..A]  =  Iqr 

Claim  1.  Also  by  Claim  1,  it  follows  that  C(j/)  =  l£(a"' 
pi) — the  command  executed  on  p-*  pi — ^is  not  among 
^(a*  ■  p')[0..A+l]  =  a(p')[0..A-|-l],  whence  (Vnooi)  «* 
satisfied. 

Case  w  ■<  -w':  We  have  w"  X  w'  ^  w.  Let  A  =  |in,  u/|. 
By  construction  of  p,  the  hypothesis  at  level  A  1 
is  naturally  active  for  the  transition  a  -*  a  ■  pi 
of  P.  But  since  w"  X  w'  V  w,  it  can  be  seen 
that  w"[0..A]  =  w'[0..A]  =  w[0..A].  It  follows  that 
ac(p)[0..A-|-l]  =  a(p')[0..A-i-l]  and  that  the  a(p)[A+l]- 
hypothesis  is  naturally  active  for  p  -♦  p'  of  P,  whence 
(Va)  and  (Vnoc)  ate  satisfied.  It  can  be  seen  in  the 
same  manner  as  in  the  previous  case  that  (V^oni)  is 
also  satisfied  □ 


Given  an  effectively  represented  program  P  and  a  pro¬ 
gram  state  p,  it  is  possible  to  calculate  the  sequence  of 
program  states,  starting  at  the  initial  state,  that  leads 
to  p.  Thus  the  tree  can  be  effectively  traversed  (even  if 
it  is  infinitely  branching).  This  traversal  ensures  that 
each  time  “new”  is  invoked,  a  unique  progress  value 
is  returned.  For  example,  we  can  represent  W  using 
the  natural  numbers;  successive  invocations  of  “new” 
then  gives  progress  values  ‘0,’  T,’. .  .Note  that  the  re¬ 
lation  >-  calculated  on  W  is  not  the  usual  ordering  on 
the  natural  numbers.  Given  a  state  p,  the  tree  is  trar- 
versed  until  p  is  encountered.  At  any  pregram  state, 
the  value  of  the  stack  is  calculated  according  to  the 
the  procedure  given  in  the  proof  of  Theorem  3.  It  fol¬ 
lows  that  the  value  of  the  fair  semi-measure  at  p  can 
be  recursively  calculated.  Similarly  the  relation  i  >-  j 
can  be  seen  to  be  recursive.  By  standard  techniques 
of  computability  theory,  the  above  procedure  can  be 
expressed  formally  as  a  recursive  function  A  satisfying 
the  properties  in  the  statement  of  the  Theorem.  □ 


Proof  of  Theorem  4 
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Abstract 

We  study  randomized,  synchronous  protocols  for  co¬ 
ordinated  attack.  Such  protocols  trade  off  the  number 
of  rounds  (N),  the  worst  case  probability  of  disagree¬ 
ment  (U),  and  the  probability  that  all  generals  attack 
(£).  We  prove  a  nearly  tight  bound  on  the  tradeoff 
between  C  and  U  (C/U  <  N)  for  a  strong  adversary 
that  destroys  any  subset  of  messages.  Our  techniques 
may  be  useful  for  other  problems  that  allow  a  non¬ 
zero  probability  of  disagreement. 

1  Introduction 

Suppose  two  computers  are  trying  to  perform  a 
database  transaction  over  an  unreliable  telephone 
line.  If  the  line  goes  dead  at  some  crucial  point,  stan¬ 
dard  database  protocols  mark  the  transaction  status 
as  “uncertain”  and  wait  until  communication  is  re¬ 
stored  to  update  its  status.  The  protocol  will  ensure 
that  the  two  computers  eventually  agree  if  communi¬ 
cation  is  eventually  restored. 

On  the  other  hand,  suppose  that  the  transaction 
has  a  real  time  constraint  (e.g.,  a  decision  to  com¬ 
mit  or  reject  the  transaction  must  be  reached  in  10 
minutes)  and  the  cost  of  disagreement  is  high.  Then 
standard  commit  protocols  do  not  work.  If  commu¬ 
nication  can  fail  for  up  to  ten  minutes  it  is  always 
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possible  for  the  two  computers  to  disagree.  Is  there 
a  protocol  that  prevents  disagreement  in  all  cases? 

The  answer  is  no.  The  question  was  first  formal¬ 
ized  in  [G]  as  the  coordinated  attack  problem.  In 
this  problem,  there  are  two  generals  who  communi¬ 
cate  only  using  unreliable  messengers.  The  generals 
are  initially  passive;  however,  at  any  instant  either 
general  may  get  an  input  signal  that  instructs  him 
to  try  to  attack  a  distant  fort.  The  generals  have  a 
common  clock.  The  problem  is  to  qmchronize  attack 
attempts  subject  to  the  conditions; 

•  Validity:  If  no  input  signal  arrives,  neither  gen¬ 
eral  attacks.^ 

•  Agreement:  Either  both  generals  attack  or 
they  both  do  not  attack. 

•  Nontriviality;  There  is  at  least  one  execution 
of  the  protocol  in  which  both  generals  attack. 

It  is  shown  in  ([G],  [HM])  that  there  is  no  deier^ 
ministic  algorithm  that  meets  all  three  conditions. 
In  this  p^er,  we  consider  a  geneiralization  to  an  ar¬ 
bitrary  number  of  generals  connected  by  a  gnq>h  of 
unreliable  links.  Clearly  the  impossibility  result  wp- 
plies  here  as  well. 

Coordinated  attack  (CA)  looks  suspiciously  like 
Byzantine  agreement  (BA)  [LP£i|.  The  mi^or  differ¬ 
ences  are:  first,  in  BA,  generals  exhibit  arbitrary  fail¬ 
ures  while  in  CA  only  finks  fail  by  destroying  mes¬ 
sages;  second,  in  BA  only  some  fraction  of  the  gener¬ 
als  are  assumed  to  be  faulty  while  in  CA  all  links  can 
be  faulty.  Thus  there  does  not  appear  to  be  apy  way 
to  reduce  CA  to  BA  or  vice  versa. 

There  is  a  well-known  history  of  randomization  pro¬ 
viding  a  cure  for  a  deterministic  impossibility  result 

^Another  validity  condition  that  k  often  need  k  Uiat  if 
no  meMogee  are  delivered,  then  no  general  attadn.  We  pre¬ 
fer  our  definition  became  it  focuses  <m  input-output  bdiavior. 
However,  our  results  can  be  modified  to  fit  the  other  validity 
condition. 
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(e.g.  [RL],  [B]).  Thus  we  we  turo  to  randomized  CA. 
We  hope  to  trade  a  small  probability  of  disagreement 
when  links  fail  for  a  high  probability  of  agreement  (on 
a  positive  outcome)  when  links  do  not  fail. 

We  modify  the  correctness  conditions  for  determin¬ 
istic  CA  to  fit  randomized  CA.  We  retain  the  '.alidity 
condition.  We  modify  the  agreement  condition  by 
requiring  that  the  worst  case  probability  of  disagree¬ 
ment  (denoted  by  U)  be  smaller  than  e,  a  parameter. 
We  replace  the  nontriviality  condition  by  a  measure 
jC{R)  (for  liveness)  that  measures  the  probability  all 
generals  attack  after  an  input  signal,  given  that  mes¬ 
sages  are  delivered  according  to  a  given  pattern  R? 
We  measure  the  goodness  of  a  CA  protocol  by  seeing 
how  high  C{R)  can  be  for  a  given  R  and  e. 

Coordinated  attack  captures  the  fundamental  dif¬ 
ficulty  of  real-time  synchronization  over  unreliable 
message  channels.  This  paper  investigates  whether 
randomization  can  help  coordinated  attack.  Our  an¬ 
swer  is  basically  no  for  nontrivial  adversaries,  and  a 
qualified  yes  for  much  weaker  adversaries.  Our  paper 
concentrates  on  a  strong  adversary  that  can  deliver 
messages  according  to  any  possible  pattern  R  but  has 
no  access  to  message  bits.^ 

The  rest  of  this  paper  is  organized  as  follows.  Sec¬ 
tion  2  contains  our  model.  Section  3  describes  a  sim¬ 
ple  but  inefficient  protocol,  and  Section  4  introduces 
some  useful  concepts.  Section  5  contains  a  basic 
lower  bound,  Section  6  describes  an  optimal  protocol 
against  a  strong  adversary,  and  Section  7  contains  a 
second,  more  refined,  lower  bound.  Section  8  contains 
our  conclusions  and  the  appendix  contains  a  proof  of 
the  second  lower  bound. 

2  Model 

The  generals  are  represented  by  processes  (  that  are 
at  the  vertices  of  a  undirected  graph  G{E,  V)  with 
V  =  {l,...,m},  m  >  2.  We  consider  synchronous 
protocols  that  work  in  N  +  2  rounds,  numbered 
1.  We  model  the  input  as  a  mes¬ 
sage  sent  at  the  end  of  a  fictitious  Round  -1  2uid  ar¬ 
riving  at  the  end  of  Round  0  from  a  fictitious  “envi¬ 
ronment”  node  Vo-  We  assume  Vq  ^V.  Informally,  if 
a  process  i  receives  a  message  in  Round  0  from  vq,  it 

may  seem  strange  that  unsafety  is  measured  as  the  worst 
case  across  all  runs  while  liveness  is  measured  separately  for 
each  run.  However,  the  situation  is  similar  to  Data  Link  pro¬ 
tocols  in  which  the  prefix  property  (safety)  is  always  preserved 
but  liveness  is  guaranteed  only  if  the  channel  is  delivering 
messages. 

^Since  our  lower  bounds  are  pessimistic,  there  is  no  point 
in  considering  a  stronger  adversary  that  can  read  message  bits. 
Also,  some  form  of  encryption  could  be  used  to  make  this  as¬ 
sumption  reasonable. 


has  received  a  signal  to  try  to  attack.  Each  process 
i  also  receives  a  sequ'^nce  of  J  random  bits  called  a,-. 
J  is  an  upper  bound  on  the  total  number  of  random 
bits  used  by  any  general. 

A  protocol  F  consists  of  a  number  of  local  protocols 
Fi.  Each  Fi,i  €  V,  is  a  state  machine  executed  by 
process  t.  Fi  has  two  possible  start  states  s?  and  sj ,  a 
state  transition  function  Si ,  and  a  message  generation 
function  cr,'.  Let  S[  be  the  set  of  messages  received  by 
t  from  its  neighbors  in  round  r.  Let  be  the  state  of 
i  at  the  end  of  round  r.  Then  gj  =  i,(gp^ ,  r,  SJ",  o<). 
We  assume  without  loss  of  generality  that  processes 
send  messages  to  each  neighbor  in  rounds  1...N  since 
we  can  always  simulate  algorithms  in  which  this  is  not 
true  by  sending  null  messages  that  are  ignored  by  the 
receiver.  Let  mj)  be  the  message  sent  by  t  to  neighbor 
j  in  round  r.  Then  mj)  =  <ri(g|'“^,  j).  At  the  end  of 
N  rounds,  i  outputs  a  bit  Oi  based  on  qf*.  O,  =  1  iff 
t  decides  to  attack. 

An  execution  of  is  described  in  terms  of  a  vector 
of  local  executions.  A  local  execution  Ei  consists  of 
9it  for  1  <  r  <  TV,  and  Oj.  To  gener¬ 

ate  an  execution  of  F  we  need  to  define  a  run  that 
represents  the  inputs  as  well  as  which  messages  get 
through  in  rounds  1...N  of  the  protocol.  Formally, 
a  run  R  =  I(R)UM(R).  I(R),  the  input  for  run  R,  is 
an  arbitrary  subset  of  {(vo,i,0) :  t  6  V'}.  M(R),  the 
messages  delivered  in  run  R,  is  an  arbitrary  subset  of 
{(*.i.  >*)  :  (*. i)  €  T?,  1  <  r  <  JV).  For  example,  in  the 
run  {(vo,  3, 0),  (1, 2, 6),  (3, 2, 7)}  only  F3  receives  a  sig¬ 
nal  to  attack.  Also  only  the  message  sent  in  Round 
6  from  F\  to  F2  and  any  message  sent  in  Round  7 
from  Fs  to  F2  are  delivered:  all  other  sent  messages 
are  lost. 

We  will  use  the  notation  (Aj)  to  denote  a  vector 
A  consisting  of  a  component  Ai  for  each  t  E  V. 
An  execution  for  a  fixed  F  is  imiquely  specified  by 
random  input  a  =  (a,-),  and  a  run  R.  We  define 
Ex{R,  a)  =  (Ei)  as  the  execution  generated  by  R  and 
a  for  a  fixed  protocol  F.  Each  E,-  is  a  local  execution 
such  that: 

•  If  (vo,i,0)  ^  R  then  g®  =  s®.  If  (vo,t,0)  €  R 
then  gf  =  s-  (i.e.,  the  initial  state  of  the  local 
execution  encodes  the  input). 

•  For  all  r,  1  <  r  <  Af:  mj)  =  v<(gf ~^,y). 

•  For  all  r,  1  <  r  <  Af :  mjj  G  SJ"  iff  (j,  i,  r)  6  R- 

•  For  all  r,  1  <  r  <  AT:  gf  =  5,(gJ'~S  r,  Sf,  or,). 

The  output  of  execution  E  is  the  vector  (Oi(gf^)). 
We  say  two  executions  E  and  E  are  identical  to  j  if 
Bi  =  Ei. 
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We  consider  sets  of  executions  of  a  particular  pro¬ 
tocol.  If  X  and  y  are  sets  of  executions,  then  XY 
denotes  X  nY,  and  X  +  Y  denotes  X  UY.  Di  de¬ 
notes  the  set  of  executions  in  which  0i(9/^)  =  1,  and 
Di  the  set  of  executions  in  which  01(9/^)  =  0.  Simi- 
luly,  {Di  |i2)  denotes  the  set  of  executions  that  have 
run  R  and  in  which  Oi{qi^)  =  1. 

TA  {total  attack)  denotes  the  set  of  executions 
DiDi...Dm-  NA  {no  attack)  denotes  the  set  of  ex¬ 
ecutions  Di  D^.-.Dm-  PA  {partial  attack)  denotes 
the  complement  of  NA  U  TA.  Thus,  TA  is  the  set  of 
executions  in  which  all  processes  agree  on  an  output 
of  1,  NA  is  the  set  of  executioits  in  which  all  pro¬ 
cesses  agree  on  an  output  of  0,  and  PA  is  the  set  of 
executions  in  which  some  pair  of  processes  disagree. 

Each  Oi  is  drawn  from  {0,  using  the  uniform 
probability  distribution.  This  probability  distribu¬ 
tion  on  inputs  a  induces  a  probability  distribution 
on  executions  for  each  possible  run  iZ,  in  the  natural 
way.  For  each  set  X  of  executions  and  each  run  R,  we 
use  the  notation  Pr[X|i2]  to  denote  the  probability  of 
event  X  according  to  this  distribution  of  executions. 

Now  consider  two  runs  R  =  1)}  and  R  = 

0.  The  only  difference  in  the  runs  is  that  t  sends  a 
message  that  is  delivered  in  R.  Thus,  given  the  same 
random  input,  t  will  decide  the  same  regardless  of 
whether  an  execution  follows  run  R  or  run  R.  This 
leads  to  a  key  notion  of  indistinguishable  runs.  We 
say  that  two  runs  R  and  R  are  indistinguishable  to 
i  if  for  all  a,  Ex{R,a)  and  Bx{R,a)  are  identical 

to  i.  We  use  R  =  R  to  denote  that  R  and  R  are 
indistinguishable  to  t.  A  natural  consequence  is: 

Lemma  2.1  If  iZ  =  ^  then  Pr[£)j|iZ]  =  Pr[Di|fl]. 

An  adversary  is  a  set  of  runs.  We  will  only  deal 
in  this  paper  with  a  strong  adversary,  A, ,  where  At 
is  the  set  of  all  possible  runs. 

Next,  we  describe  the  correctness  conditions  and 
the  liveness  measure.  Validity  requires  that  no  pro¬ 
cess  attacks  if  there  is  no  input.  Agreement  requires 
that  the  worst-case  probability  of  partial  attack  be  no 
more  them  e,  a  parameter.  Finally  the  liveness  mea¬ 
sure  for  a  run  R  is  the  probability  of  total  attack  on 
run  R. 

•  Validity  :  A  protocol  satisfies  validity  if  for  all 
vectors  a,  for  all  R  such  that  f{R)  =  0,  and  for 
all  i:  Oi  =  0  in  Ex{R,  a). 

•  Agreement:  We  define  Ua{P),  the  unsafety  of 
protocol  F  against  adversary  A,  as;  Ua{F)  = 
Maxjt^APr[PA\R].  Then  F  satisfies  agreement 
with  parameter  c  if  Ua{F)  <  e. 


•  Liveness:  We  define  liveness  C{F,  R)  of  protocol 
F  on  run  R  by:  £{F,  R)  =  Pr[TA\R]. 

Our  goal  is  to  find  an  “optimal”  algorithm  F  that 
meets  the  validity  and  agreement  conditions,  and 
such  that  C{F,  iZ)  is  as  large  as  possible  for  any  run 
R.  We  end  this  section  with  two  elementary  lem¬ 
mas  on  which  our  lower  bounds  are  based.  The  first 
states  that  the  unsafety  is  at  least  as  large  as  the  dif¬ 
ference  in  attack  probabilities  of  any  two  processes. 
The  second  states  that  the  liveness  is  no  more  than 
the  attack  probability  of  any  process.  The  two  in¬ 
equalities  given  below  do  not  seem  very  tight,  and  so 
it  is  perhaps  surprising  that  the  lower  bounds  based 
on  these  inequalities  are  as  tight  as  they  are. 

Lemma  2.2  For  all  i,j  £  V,  Pr[Di\R\  —  Pr[Dj\R\  < 
Ut{F). 

Lemma  2.3  For  all  »  £  V,  C{F,R)  <  Pr[ZJil/Z]. 

3  Example  Protocol 

We  informally  describe  a  simple  protocol  A  for  two 
processes  1  and  2  against  a  strong  adversary.  The  lim¬ 
itations  of  this  protocol  will  motivate  both  the  lower 
bound  in  Section  5  and  the  optimal  protocol  of  Sec¬ 
tion  6. 

In  order  to  conform  to  the  model,  we  require  that 
each  process  must  send  some  message  (at  least  a  null 
message)  in  every  round.  For  convenience,  let  us  call 
a  non-null  message  (i.e.,  a  message  that  carries  in¬ 
formation)  a  peicket.  We  assume  implicitly  that  on 
every  round  a  process  sends  either  a  packet  or  a  null 
message. 

Initially,  at  the  start  of  round  0,  process  1  chooses 
a  random  integer  rfire  that  is  uniformly  distributed 
between  2  and  N.  Process  1  includes  the  value  of 
rfire  in  any  packet  it  sends.  If  process  2  receives  amy 
packet  from  process  1,  process  2  will  store  the  value 
of  rfire. 

In  rounds  1  through  N,  the  two  processes  send 
packets  to  each  other  in  alternate  rounds.  Process 
2  is  allowed  to  send  packets  in  odd  rounds  starting 
from  round  1,  while  process  1  is  allowed  to  send  pack¬ 
ets  in  even  rounds.  The  protocol  begins  with  process 
2  sending  a  packet  in  round  1.  However,  in  all  later 
rounds,  a  process  sends  a  packet  in  a  round  only  if 
it  has  received  a  packet  in  the  previous  round,  and 
it  is  allowed  to  send  a  packet  in  the  round.  Thus  if 
the  adversary  destroys  a  packet  sent  in  round  r,  all 
packet  sending  stops  in  rounds  greater  than  r. 

The  maun  idea  is  that  if  all  packets  sent  strictly 
before  round  number  rfire,  have  been  delivered,  then 
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the  process  that  received  the  last  packet  (say  t)  will 
decide  to  attack.  If  the  next  packet  sent  by  process  t 
is  delivered  then  the  other  process  (say  j)  will  also  de¬ 
cide  to  attack.  On  the  other  hand,  if  any  packet  sent 
before  round  rfire  is  destroyed,  then  both  processes 
stop  sending  packets  and  do  not  attack.  Since  the  ad¬ 
versary  that  controls  message  delivery  does  not  know 
the  value  of  rfire,  the  adversary  has  only  a  chance  of 
approximately  l/N  of  causing  partial  attack.  This  is 
because  the  adversary  can  cause  partial  attack  only 
if  the  first  packet  destroyed  in  the  run  is  the  packet 
sent  in  round  rfire.  Thus  U,(A)  «  l/N. 

In  addition,  process  2  includes  a  bit  that  encodes  its 
input  in  the  packets  it  sends.  Suppose  at  the  end  of 
Round  1,  process  1  has  not  received  a  signal  to  attack 
and  has  not  received  a  packet  from  process  2  saying 
that  process  2  has  received  a  signd  to  attack.  Then 
process  1  does  not  send  a  packet  in  Round  2,  and 
the  protocol  stops.  Thus  protocol  A  satisfies  validity. 
Finally,  let  Rg  be  a  “good”  run  in  which  all  messages 
are  delivered  and  the  input  is  valid.  Then  on  run  Rg, 
both  processes  will  always  decide  to  attack.  Hence 
C{A,  Rg),  the  liveness  of  A  on  run  iZy,  is  1.  However, 
this  simple  protocol  raises  two  questions; 

•  Ui{A)  «  l/N  and  C{A,Rg)  =  1.  Can  we  de¬ 
crease  U,(A)  further  while  keeping  C(A,Rg)  un¬ 
changed?  In  other  words,  can  we  find  a  protocol 
a)  whose  probability  of  making  a  mistake  is  bet¬ 
ter  than  l/N,  and  b)  whose  probability  of  attack¬ 
ing  on  a  good  run  is  1.  It  might  seem  that  this 
can  be  done  by  running  A  several  times.  How¬ 
ever,  the  answer  is  no,  as  we  show  in  Section  5. 

•  Consider  a  run  R  in  which  the  input  is  valid 
and  all  messages  are  delivered  except  the  mes¬ 
sage  sent  by  process  1  in  Round  2.  It  is  easy  to 
see  that  C{A,  R)  =  0.  Intuitively,  this  is  not  sat¬ 
isfactory  because  in  run  R,  all  but  one  message 
is  delivered,  and  yet  the  probability  of  attacking 
on  run  R  is  0.  Can  we  design  a  protocol  whose 
liveness  grows  in  some  fashion  with  the  number 
of  messages  delivered  in  a  run?  We  will  describe 
an  “optimal”  protocol  5  in  Section  6. 

4  Information  Flow,  Clipping, 
and  Information  Level 

In  this  section,  we  describe  three  concepts  that  un¬ 
derlie  both  the  lower  bounds  of  Section  5  and  the 
protocol  in  Section  6.  We  begin  with  a  definition 
that  captures  the  usual  idea  of  information  flow  or 
possible  causality  [L]  between  process-round  pairs  in 
a  run. 


Consider  any  i,k  €  V  U  {vo}  and  any  r,s  € 
{— 1,0, ...,fV).  We  say  that  (i,r)  directly  flows  to 
(ib,s)  in  run  iZ  iff  s  =  r  1  and  either  i  =  k  at 
(t,  k,  s)  E  R.  We  define  the  flows  to  relation  between 
process-round  pairs  as  the  reflexive  transitive  closure 
of  the  directly  flows-to  relation.  Thus; 

Lemma  4.1  If  (i,  r)  flows  to  {j,  s)  and  {j,  s)  flows  to 
(k,t)  in  run  R,  then  (t,r)  flows  to  (k,t)  in  run  R. 

We  introduce  a  measure  of  the  “knowledge”  [HM] 
a  process  has  in  a  run.  We  first  define  information 
“height”  and  use  it  to  deflne  the  more  useful  idea 
of  information  “level” .  Intuitively,  a  process  reaches 
height  1  when  it  hears  the  input.  A  process  reaches 
height  h  >  1  when  it  has  heard  that  all  other  pro¬ 
cesses  have  reached  height  h  —  I .  More  formally,  we 
say  that  j  can  reach  height  A  by  round  r  in  run  R 
iff  h  is  a  nonnegative  integer  subject  to  the  following 
conditions: 

•  If  h  =  0,  there  are  no  conditions. 

•  If  /»  =  1,  (vo,  —1)  flows  to  (j,  r)  in  R. 

•  If  h  >  1,  then  for  all  t  j  E  V,  there  is  some 
ri  such  that  (t,ri)  flows  to  (j,r)  in  R  and  i  can 
reach  height  h  -  1  by  round  r,-  in  R. 

Next,  we  define  L^{R),  the  level  j  reaches  by  round 
r  of  run  R,  to  be  the  maximum  height  j  can  reach  by 
round  r.  We  use  Lj{R)  to  denote^ 
to  denote  Minj^v{Lj{R)). 

Finally,  we  introduce  a  construction  to  “clip”  a  run 
with  respect  to  a  process  t  such  that  the  constructed 
run  preserves  all  information  flow  to  t.  This  construc¬ 
tion  is  the  key  to  the  lower  bound  proof.  We  define 
Clipi{R)  =  {{j,k,r)  E  R  :  (ifc,r)  flows  to  {i,N)}  in 
run  R.  It  is  not  hard  to  see  that  clipping  with  respect 
to  t  preserves  any  information  that  i  can  gather  in  the 
run.  Hence  we  have: 

Lemma  4.2  Let  Clipi{R)  =  R.  Then  Li(R)  =  Li(R) 
and  R=  R. 

5  Lower  Bound  for  Strong  Ad¬ 
versary 

The  first  lemma  captures  the  intuitive  idea  that  a 
change  in  level  can  only  come  about  by  receiving  a 
message. 

•  Recall  that  N  ia  the  maximum  round  number 
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Lemma  5.1  For  any  run  R  and  any  k€V,  if  Lk{R)  = 
I  >  0  then  there  must  be  some  tuple  (j,  k,r)£  R  such 
that  Ll{R)  =  1. 

Proof:  From  the  definition  of  level,  we  see  that  if 
there  is  no  j,  s  such  that  (j,  k,s)€  R  then  = 

Ll{R).  Thus  if  L^{R)  =  I  we  can  work  backwards 
from  round  number  N  until  we  find  the  r  required  for 
the  lemma.  If  we  fail  then  there  is  no  (*,  k,  *)  tuple 
in  R,  which  would  imply  that  f  =  0,  a  contradiction. 
Thus  we  cannot  fail.  □ 

The  next  lemma  describes  the  key  property  of 
clipped  runs  and  information  levels  that  we  use  to 
prove  our  lower  bound.  It  says  that  if  i  reaches  infor¬ 
mation  level  /  at  the  end  of  run  R  then  at  the  end  of 
Clipi{R)  there  must  be  some  process  k  whose  infor¬ 
mation  level  is  no  more  than  /  —  1.  In  essence,  this  is 
why  i  cannot  go  to  a  higher  information  level  than  I 
by  the  end  of  R. 

Lemma  5.2  Consider  a  run  R  such  that  Li{R)  =  /  > 
0  and  Clipf(R)  =  R.  Then  there  is  some  k  €V  such 
that  Lk{R)  <1-1. 

Proof:  By  contradiction.  Thus  for  all  it  G  V^,  we 
assume  that  Lk{R)  >  1. 

Consider  any  k  ^  i.  By  Lemma  5.1  and  the  fact 
that  I  >  0,  there  must  be  some  tuple  (j,  k,r)  €  R  such 
that  Lk{R)  >  1.  Since  (j,  k,r)  E  R  then  (by  definition 
of  clipping),  (k,  r)  flows  to  (i,  N)  in  R.  Hence,  we  can 
show  that  (k,  r)  flows  to  (»,  N)  in  R.  We  also  know 
that  L^k{{t)  >  1.  Since  this  is  true  for  all  it  ^  i  we 
must  have  (see  the  definition  of  level)  Li{R)  >  f  -F  1. 
But  by  Lemma  4.2,  this  implies  that  Lj(iJ)  >  /  -F  1, 
a  contradiction.  □ 

Lemma  5.3  For  all  protocols  F,  all  runs  R,  and  any 
process  index  i£V,  Pr[Di|fZ]  <  U4{F)Li{R). 

Proof:  By  induction  on  /  in  the  following  inductive 
hypothesis. 

Inductive  hypothesis:  For  all  t  and  all  runs  R 
with  Li{R)  =  /,  Pr[Di\R]  <  ff,(P)/. 

Base  case,  I  =  0:  Thus  Z/i(P)  =  0.  Let  R  = 
Clipi(R).  We  first  claim  that  I{R)  =  {}.  Suppose 
not  for  contradiction.  Then  there  is  some  j  such  that 
(^o.iiO)  €  R\  hence,  since  R  C  R,  (vo,j,0)  €  R- 
Also  by  the  definition  of  clipping,  (j,  0)  flows  to  (j,  N) 
in  R.  But  in  that  case,  Li{R'l  >  1,  a  contradic¬ 
tion.  Thus  we  must  have  I{R)  =  {}.  Also  by 

Lemma  4.2,  Rh  R.  Hence  Pr[Uj|ii]  =  Pr[I)j|P]  = 
0,  by  Lemma  2.1  and  the  validity  requirement.  Thus 
Pr[A|P]  =  U,iF)Li{R). 


Inductive  Step,  /  >  0:  Consider  any  /  and  R 
such  that  Li{R)  =  /.  Let  R  =  Clipi(R).  By 
Lemma  5.2,  there  exists  some  k  such  that  Lk(R)  < 
Li(R)  —  1.  Hence,  by  the  inductive  hypothesis, 
Pr[Dt|P]  <  Ut{F){l  —  1).  But  by  our  bound  on  im- 
safety,  Lemma  2.2,  Pr[A|P]  —  Pr[I?i|.R]  <  U$(.F). 
Hence  Pr[AlP]  <  U,{F)l.  But  by  the  fact  that  R 
and  R  are  indistinguishable  to  i  and  by  Lentuna  2.1, 

it  follows  that  Pr[A|P]  <  1^»(P)/-  □ 

Theorem  5.4  For  any  F,  C(F,R)  <  U,{F)L{R)  < 
eL{R). 

From  Lemma  5.3,  for  any  i  G  V,  Pr[Dj|-R]  < 
Ut(F)Li{R).  Thus  from  Lemma  2.3,  C(F,FL)  < 
Ut{F)LilR)  for  any  i  €  V.  Thus  from  the  defini¬ 
tion  of  L{R),  C(F,R)  <  Ut{F)L{R).  The  theorem 
now  follows  from  the  agreement  condition.  □ 

6  Optimal  Protocol  Against  a 
Strong  Adversary 

In  Protocol  S  which  we  describe  below,  we  will  arbi¬ 
trarily  designate  process  1  to  choose  a  random  num¬ 
ber  rfin.  In  order  to  attack,  we  will  require  that  any 
other  process  i  hear  the  value  of  rfire  from  process 
1  in  addition  to  hearing  the  input.  This  motivates  a 
second  measure  on  a  run  R  that  we  call  the  modified 
level  measure.  It  is  defined  in  a  parallel  fashion  to 
the  original  level  measure  by  first  defining  a  modified 
height  or  m-beight.  Formally,  we  say  that  process  j 
can  reach  m-height  h  by  roimd  r  in  run  P  iff  h  is 
a  nonnegative  integer  subject  to  the  following  condi¬ 
tions: 

•  If  />  =  0,  there  are  no  conditions. 

•  If  h  =  1,  (vo,  —1)  and  (1,0)  flow  to  (j,  r)  in  R. 

•  U  h  >  1,  then  for  all  t  ^  j  G  V,  there  is  some 
r,'  such  that  (t,  r^)  flows  to  (j,  r)  in  R  and  t  can 
reach  m-heigbt  A  —  1  by  round  r,-  in  R. 

Thus  the  only  difference  between  the  m-heigbt  and 
height  definitions  is  in  the  condition  required  to  reach 
m-height  1.  In  the  case  of  m-height  we  not  only 
require  that  j  has  heard  the  input  but  also  that  j 
has  heard  from  process  1.  We  also  define  MLj(P), 
MLi{R),  ML{R)  analogously  to  the  previous  defini¬ 
tions  for  L{. 

Because  of  the  small  difference  in  the  definitions, 
it  is  easy  to  show  that  the  modified  level  measure 
differs  by  at  most  one  from  the  level  measure.  Also 
the  modified  level  measured  by  any  two  processes  can 
differ  by  at  most  one. 
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Lemma  6.1  For  all  R  and  i  €  V,  Li(R,)  -  1  < 
MLiiR)  <  LiiR). 

Lemma  6.2  For  all  R  and  i,j  e  V,  MLj{R)  > 
MLi{R)  -  1. 

We  will  design  a  protocol  based  closely  on  the  lower 
bound  arguments  of  the  previous  section.  Recall  that 
we  had  shown  that  for  any  F,  £(F,  R)  <  €L{R).  We 
have  also  seen  that  the  modified  level  measure  differs 
by  at  most  one  from  the  level  measure.  Thus  in  order 
to  come  close  to  meeting  the  lower  bound,  we  will 
design  a  protocol  in  which: 

•  Each  process  t  will  calculate  MLi{R),  the  value 
of  the  modified  level  at  the  end  of  the  current 
run  R. 

•  Each  process  will  decide  to  attack  with  a  prob¬ 
ability  proportional  to  MLi(R).  This  causes  the 
liveness  of  the  protocol  to  grow  with  MLi{R). 

To  do  so  each  process  i  in  protocol  S  has  a  vari¬ 
able  counii  that  counts  the  value  of  MV^^R).  We  say 
that  I  has  begun  counting  if  counti  >  0.  We  will  see 
how  i  begins  counting  below.  However,  once  t  has 
begun  counting,  process  i  increases  counU  to  s  (for 
s  >  1)  when  it  has  heard  that  all  other  processes 
have  reached  a  count  of  s  —  1.  It  is  easy  to  implement 
this  if  each  message  sent  by  a  node  t  C2urries  counU- 
euad  a  variable  called  see>i<,  the  set  of  nodes  that  t 
knows  has  reached  counti. 

Protocol  5  must  satisfy  agreement  with  parameter 
€.  Let  t  =  1/e.  Process  1  chooses  a  random  num¬ 
ber  rfire  uniformly  distributed  in  the  range  (0,f]  and 
passes  it  on  all  messages.  After  N  rounds,  i  decides 
to  attack  if  t  has  heard  the  value  of  rfire  from  process 
1  and  counti  >  rfire. 

Process  i  starts  counting  (i.e.,  sets  counU  to  1)  in 
round  r  as  soon  it  finds  out  that  (vo,  -1)  and  (1,0) 
flows  to  (t,  r).  We  have  discussed  the  reason  for  the 
second  condition.  The  first  condition,  of  course,  is 
imposed  to  ensure  validity.  To  implement  the  first 
condition,  we  use  a  variable  validi  at  each  process 
i  that  is  set  to  true  in  the  first  round  r  such  that 
(vo,  —1)  flows  to  (t,  r).  To  implement  the  second  con¬ 
dition,  all  processes  other  than  process  1  initially  set 
the  value  of  rfirCi  to  a  special  value  undefined  which 
is  updated  when  a  message  is  received  with  the  value 
of  rfire. 

6.1  Protocol  Code 

Protocol  5  consists  of  local  state  m^u:hine8,  each  of 
which  has  a  set  of  states,  an  initial  state,  a  state 


transition  function,  a  message  generation  function, 
and  an  output  decision  fimction.  We  describe  each 
component  in  turn: 

Each  process  i  has  the  following  state  variables: 

•  counU'.  integer  between  1  and  N  (coimts  the 
value  of  MLi(R)  in  the  current  run  R.). 

•  rfirCi'.  either  a  default  value  of  undefined  or  a  real 
number  in  the  range  (0, 1/e].  We  assume  that  the 
value  of  undefined  is  not  in  (0, 1/e]. 

•  aeeni'.  a  subset  of  V  (represents  the  processes 
that  have  reached  counU  that  t  knows  about). 

•  validi'.  n  boolean  (that  is  true  if  t  has  heard  from 
uo) 

We  also  use  three  temporary  variables  at  each  pro¬ 
cess:  highcounti  (an  integer),  highseenf  (a  sub^t  of 
V),  and  highseti  (a  set  of  messages,  whose  format  we 
describe  later.) 

The  initial  states  are  as  follows.  Process  1  ini¬ 
tially  sets  rfire^i  to  a  a  random  number  uniformly  dis¬ 
tributed  in  the  range  (0,  l/e].  All  processes  i  other 
than  1,  set  rfirei  =  undefined.  The  validi  bit  is  only 
set  if  process  i  has  received  an  input  message  from 
Vo  in  Round  0.  Finally  process  1  sets  counti  =  1 
iff  validi  =  1.  All  other  processes  t  initially  set 
counU  =  0. 

A  message  is  denoted  by  m  and  has  fields  m{rfire), 
m( count),  m(seen),  and  m(valii).  The  message  gen¬ 
eration  function  for  t  in  every  round  sends  a  mes¬ 
sage  m{rfire,  count,  seen,  valid)  to  all  neighbors  with 
m{rfire)  =  rfirCi,  m(counf)  =  count,-,  m(5een)  = 
seeni,  m{yalii)  =  validi.  Thus  i  sends  a  message 
with  its  current  state  to  all  neighbors  in  every  round. 

At  the  end  of  a  round  r,  for  1  <  r  <  AT,  process  t  ex¬ 
ecutes  the  procedure  PROCESS-MESSAGE(5{,t)  where 
Si  is  the  set  of  messages  process  t  has  received  in 
round  r.  Process-message(5',-,  i)  is  shown  in  Fig¬ 
ure  1.  The  first  four  lines  are  used  to  decide  when 
a  process  starts  counting;  the  remainder  of  the  code 
does  the  actual  counting. 

Finally  at  the  end  of  N  rounds,  <?,•  =  1  iff  rfirCi  ^ 
undefined  and  count,-  >  rfirei. 

6.2  Proof  of  Properties  of  Protocol  5 

Notation:  Consider  any  execution  Ex{R,  a).  Let  v’' 
denote  the  value  of  a  variable  at  the  end  of  round  r. 
For  example,  coun^  denotes  the  value  of  count,  at 
the  end  of  r  rounds.  Define  rfire  to  be  the  value  of 
rfire^  in  the  initial  state. 

Our  first  major  step  will  be  to  establish  that 
countj  =  MVi{R).  To  allow  a  careful  inductive  proof. 
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PROCESS-MESSAGE(5i,  l) 

If  {rfirci  =  undefined^  and  (3m  €  St  : 

m{rfire)  /  undefined)  then  rfire^  :=  m(rfire) 

If  {validi  =  fcdse)  and  (3m  €  Si  :  m{valid)  =  true) 
then  validi  :=  true 

If  (validi  =  true)  and  (rfire^  /  undefined) 
and  (counk  =  0)  then  counti  :=  1 

If  (counti  >  1)  and  (Si  ^  0)  then 
highcount:=  M  axmeSim(count) 
highset:=  {m  €  5t  :  m(count)  =  highcoun^ 
highseen:=  U^^highseM’^^) 

If  highcount=  counti  then 

seem  :=  seem  U  highseenU  {t} 

Else 

If  highcount  >  counti  then 
seem  :=  highseenU  {•}; 
counk  :=  highcounk 
If  seem  =  V  then 
counk  :=  counk  +  1; 
seem  •■=  {«}; 


Figure  1 :  Procedure  executed  by  process  t  at  the  end  of  a  round 
in  Protocol  S 


we  will  introduce  invariants.  The  invariants  should  be 
intuitively  clear  from  the  previous  discussion.  The 
proofs  of  these  invariants  are  deferred  to  the  final  par 
per. 

Lemma  6.3  For  any  execution  Ex(R,  a)  of  Protocol 
S,  the  following  assertions  are  true  for  0  <  r  <  iV  and 
for  all  t,  j  6  V: 

1.  rfir^  is  either  equal  to  rfire  or  undefined. 

2.  counff  >  1  iff  rfirt^  =  rfire  and  valid^  =  true. 

3.  (1,0)  flows  to  (*,r)  iff  rfir^  =  rfire. 

4.  (vo,  -1)  flows  to  (»,r)  iff  wo/idj  =  true. 

5.  If  (j,  s)  flows  to  (i,  r)  in  R  then  either  (counft  > 
count-)  or  (j  €  seenj  and  counfi  =  county)  or 
(coun^  =  countj  =  0). 

6.  If  (j  €  seettj)  then  there  is  some  s  such  that 
(counfj  =  counti)  (ji*)  A®'**  t®  (*>^) 

7.  seenj  ^  V  and  seenj  {i}-  Also,  if  county  > 
1  then  t  €  seen^ . 

8.  MLi  >  counfi. 


These  invariants  can  now  be  used  to  establish  that 
each  process  counts  a  value  equsJ  to  its  modified  level 
measure.  This  should  not  be  hard  to  believe  since  the 
code  follows  the  definition  of  modihed  level. 

Lemma  6.4  For  all  i  £  V,  any  r  such  that  0<r  <  N, 
and  any  execution  Ex(R,  a)  of  Protocol  S:  count-  = 
MLl(R). 

Proof;  FVom  the  last  invariant  in  Lemma  6.3,  we  see 
that  coun^  <  MLi(R).  So  we  show  that  counti  > 
MLi(R).  We  do  so  by  induction  on  the  value  of 
ML'i(R). 

First  if  MLi(R)  =  0  we  are  done  trivially  since 
coring  is  always  nonnegative.  We  use  MLi(R)  =  1  as 
the  base  case.  Then  from  the  definition  of  MLi(R), 
we  know  that  (t>0i— 1)  and  (1,0)  flow  to  (i,r)  in 
run  R.  Hence  by  the  third  and  fourth  invariants  in 
Lemma  6.3,  rfin^  =  rfire  and  valid^  =  true.  Hence 
by  the  second  invariant  in  Lemma  6.3,  county  >  1. 

Next,  suppose  MLi(R)  =  I  >  1.  Then  from  the 
definition  of  MLi(R),  we  know  that  for  all  j  ^  i  there 
exists  rj  <  r  such  that  (j,rj)  flows  to  (*,r)  in  run 
R  and  ML^  =1  —  1.  Hence  by  the  fifth  invariant 
in  Lemma  6.3  and  the  inductive  hypothesis,  either 
counfi  >  1  —  1  (in  which  case  we  are  done)  or  for  all 
j  ^  t,  j  €  seen^ .  But  the  second  case  contradicts  the 
seventh  invuiant  in  Lemma  6.3,  and  so  we  are  done. 
□ 

Next  we  sketch  proofs  of  the  validity,  unsafety,  and 
liveness  properties  of  S. 

Theorem  6.5  Protocol  5  satisfies  validity. 

Proof:  Informally,  in  any  execution  in  which  no  pro¬ 
cess  receives  an  input  signal,  no  process  hears  from 
Vo,  and  so  countj  =  0  for  all  t.  Thus  by  the  output 
decision  function,  Oi  =  0  for  all  i  in  this  execution. 

More  formally,  fix  a  run  R  such  that  I(R)  =  {}, 
a  random  vector  a,  and  any  process  t.  Consider  the 
execution  Ex(R,a).  Thus  (vo,— 1)  does  not  flow  to 
(*,  N)  for  any  i  €  V.  Thus  by  Invariant  4  in  Lemma 
6.3,  valid^  =  false.  Hence  by  Lemma  6.3,  Invariant 
2,  countj  <  1.  Hence  countj*  =  0.  However,  rfirej* 
by  Invariant  1,  Lemma  6.3,  is  either  equal  to  rfire 
(which  is  strictly  greater  than  0)  or  undefined.  In 
either  case,  by  the  output  decision  function,  Oi  =  0 
in  Ex(R,a).  □ 

To  prove  the  unsafety  and  liveness  properties  of  S 
we  characterize  when  the  total  attack  and  no  attack 
events  occur.  Let  Mincount  be  the  minimum  across 
all  processes  t  of  the  value  of  counfi  at  the  end  of  an 
execution.  The  next  lemma  states  that  all  processes 


2A7 


will  attack  if  Mincouni  is  no  less  than  rfin,  and  no 
process  will  attack  if  Mincouni  is  strictly  less  than 
r^re-  1; 

Lemma  6.6  Fix  an  execution  E  of  Protocol  5.  If 
Mincouni  >  rfire  then  E  €  TA-,  but  if  Mincouni  < 
rfire  —  1  then  E  €  NA. 

Proof:  If  Mincouni  >  rfire  then  for  all  processes 
t,  coun^  >  rfire.  But  rfire  >  0,  hence  for  all  i, 
counif  >  1.  Hence  (by  Lemma  6.3,  invariant  2),  for 
all  t,  rfirt^  =  rfire.  Hence  for  all  i,  count^  >  rfin^ 
and  rfireli  ^  undefined.  Hence  for  all  i,  (by  the  deci¬ 
sion  function),  Oi  =  1-  Hence,  E  €  TA. 

If  Mincouni  <  rfire  —  1,  then  using  Lemma  6.4  and 
using  the  fact  that  the  modified  level  measured  at  any 
two  processes  differs  by  at  most  1  (Lemma  6.2),  for 
all  «,  count^  <  rfire.  Now  (by  Lemma  6.3,  Invari¬ 
ant  1),  either  rfir^  =  rfire  or  rfire^  =  undefined. 
Hence,  for  all  i  €  V,  either  coun^  <  rfire^  or 
rfire^  =  undefined.  Thus  by  the  definition  of  the  out¬ 
put  decision  function  Oi  =  0  for  all  i.  Hence  E  €  NA. 
□ 

Theorem  6.7  S  satisfies  agreement  with  parameter  e. 


Proof:  By  definition  U,{S)  is  the  maximum  across  all 
runs  R  of  Pr[PA\R].  Consider  any  execution  E  = 
Ex{R,a).  Now  partial  attack  PA  is  the  complement 
of  the  no  attack  and  tot2d  attack  events,  NA  and  TA. 
From  Lemma  6.6,  we  know  that  either  TA  or  NA  will 
occur  unless  Mincouni  <  rfire  <  Mincouni-\-l.  Hence 
Pr[PA\R\  <  Pr[Mincouni  <  rfire  <  Afincottnl-|-l|ii]. 
Now  for  a  given  R,  Mincouni  is  fixed  while  rfire  is  a 
uniformly  distributed  random  number  in  the  range 
(0, 1/e].  Thus  I/,(s)  <  c.  □ 

Theorem  6.8  £(5,H)  >  Min(l,(ML(R)). 

Proof:  Recall  the  definition  of  C(S,  R)  as  the  prob¬ 
ability  of  total  attack,  Pr\TA\R].  We  find  a  lower 
bound  on  Pr\rA\R\.  Consider  any  execution  E. 
From  Lemma  6.6,  E  G  TA  if  Mincouni  >  rfire. 
But  by  Lemma  6.4  and  the  definition  of  Mincouni, 
Mincouni  =  ML(R).  Hence,  E  €  TA  if  ML(R)  > 
rfire.  Thus  for  any  run  R,  Pr[7’j4|f2]  is  no  less  than 
Pr[ML(R)  >  rfire\R\.  Now  for  a  given  R,  ML(R) 
is  fixed  while  rfire  is  a  uniformly  distributed  reuidom 
number  in  the  range  (0,  l/ej.  Thus  f’r[TA|A]  is  no 
leas  than  Min(l,eML(R).  □ 


7  Closing  the  Gap:  A  Second 
Lower  Bound 

Theorem  5.4  states  that  for  every  run  R  and  every 
protocol  F,  the  liveness  C(F,  R)  of  any  protocol  F  is 
at  most  Min(l,eL(R)).  We  described  a  protocol  S 
whose  liveness  is  Min(l,eML(R)).  FVom  Lemma 6.1, 
we  know  that  ML(R)  differs  from  L(R)  by  at  most 
one.  Thus  we  have  a  small  but  irritating  gap  of  e. 
Our  second  lower  bound  shows,  under  a  reasonable 
set  of  conditions  that  we  call  the  usual  case  assump¬ 
tion,  that  no  protocol  F  can  do  better  than  cML(R) 
on  all  runs  R.  More  precisely,  if  any  protocol  F  has 
a  run  R  such  that  C(F,  R)  >  €ML(R)  then_  there  is 
some  other  run  R  such  that  £(F,  R)  <  eML(R).  Thus 
together  the  two  bounds  show  that  Protocol  5  is  in¬ 
deed  “optimal”. 

A  precise  description  of  the  second  lower  bound  is 
in  the  appendix.  We  note  that  the  proof  of  the  first 
lower  bound  is  similar  to  the  chain  arguments  used  of¬ 
ten  in  deterministic  impossibility  results  (e.g.,  [FL]). 
However,  in  proving  the  second  lower  bound,  we  are 
led  to  some  connections  between  causality,  probabilis¬ 
tic  independence,  and  probabilistic  agreement  that 
may  be  interesting  in  their  own  right. 

8  Conclusions 

A  strong  adversary  can  be  used  to  model  a  situar- 
tion  where  links  can  crash  and  restart  at  an  arbitrary 
frequency.  A  solution  to  coordinated  attack  is  impor¬ 
tant  in  situations  where  consensus  must  be  reached 
across  unreliable  links  and  within  a  specified  time 
constraint.  For  coordinated  attack  against  a  strong 
fulversary,  we  have  seen  that  no  protocol  can  achieve 
a  tradeoff  between  liveness  and  safety  (C/U)  that  is 
better  than  linear  in  the  number  of  rounds.  This  is 
bad  news.  For  example  if  we  want  to  achieve  live¬ 
ness  with  probability  1  on  some  run,  and  yet  limit 
the  probability  of  error  to  be  less  than  0.001,  then 
the  protocol  must  run  for  at  least  1000  rounds.  Pro¬ 
tocol  5  demonstrates  that  the  lower  bounds  are  tight, 
but  its  performance  is  far  from  adequate.  While  our 
results  are  stated  in  a  synchronous  model,  it  seems 
cleat  that  they  can  be  extended  to  an  asynchronous 
model. 

In  practice,  there  are  two  approaches  that  may  help 
us  to  overcome  these  limitations.  One  approach  is 
to  add  redundant  links  and  assume  that  failures  can 
only  affect  some  fraction  of  the  links  in  the  network; 
then  solutions  similar  to  Byzantine  Agreement  can 
be  used.  However,  this  approach  is  expensive.  The 
other  approach  is  to  assume  a  weaker  failure  model 
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than  a  strong  adversary.  One  such  adversary,  which 
we  call  a  weak  adversary,  is  a  probabUistic  adversary 
which  can  destroy  messages  with  a  probability  p  that 
is  not  known  in  advance.  We  have  preliminary  results 
that  show  vastly  improved  performance  against  such 
an  adversary. 
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A  Lower  Bound  Based  on  Inde¬ 
pendence 

Our  second  lower  bound  needs  the  following  assump¬ 
tion.  We  say  that  the  usual  case  assumption  holds 
if: 

•  The  graph  G  is  connected  and  the  diameter  of  G 
is  no  more  than  the  number  of  rounds  N. 

•  e  <  0.5. 

It  is  easy  to  see  that  these  two  conditions  capture 
the  usual  and  interesting  cases.  If  the  first  condition 
does  not  hold  then  it  can  be  shown  that  Li{R)  <  1 
for  all  i,R,F  and  so  by  Lemma  5.4,  C{F,R)  <  e. 
Similarly  if  the  second  condition  does  not  hold,  the 
protocol  is  allowed  to  fail  more  than  half  the  time. 
Thus  the  conditions  preclude  parameter  settings  that 
force  absurdly  small  values  of  liveness  and  allow  ab¬ 
surdly  large  values  of  unsafety. 

Theorem  A.l  Under  the  usual  case  assumption,  if 
any  protocol  F  has  a  run  R  such  that  C(F,R)  > 
eML{R)  then  there  is  some  other  run  R  such  that 
C{F,R)<eML{R). 

The  proof  exploits  a  simple  connection  between 
probabilistic  independence  and  what  we  call  causal 
independence.  Intuitively,  two  processes  are  causally 
independent  if  there  is  no  causal  flow,  possibly 
through  another  process,  that  can  link  the  two  pro¬ 
cesses.  For  any  i,j  £  V,  we  say  that  »  and  j  are 
causally  independent  in  run  R  if  there  is  no  ib  G  V  such 
that  (fc,0)  flows  to  (t,  N)  and  (ik,0)  flows  to  {j,N) 
in  R.  The  connection  is  expressed  by  the  intuitive 
lemma: 

Lemma  A.2  If  i  and  j  are  causally  independent  in  run 
R  then  the  events  (Di|R)  and  (D;|R)  are  independent 
events. 

If  t  and  j  are  causally  independent  in  run  R,  then 
there  must  be  some  restrictions  on  their  decision 
probabilities  in  R  in  order  to  preserve  the  agreement 
property.  There  are  several  ways  in  which  these  re¬ 
striction  can  be  phrased;  we  select  one  that  is  suffi¬ 
cient  for  the  later  development. 

Lemma  A.3  Consider  a  run  R  in  which  t  and  j  are 
causally  independent  and  such  that  Pr[Di\R\  =  e. 
Then  if  e  <  .5,  Pr[D,  |R]  =  0. 

Proof:  Let  Pr[Dj|R]  We  know  that  Pr[PA[R]  > 
Pr[DiDj\R\  -1-  Pr[DjDi\R].  But  since  t  and  j  are 
causally  independent  in  R  we  have  by  Lemma  A.2 
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that  the  events  (D<|A)  sad  (D^|ii)  ate  indepen¬ 
dent.  Hence,  Pr[PA|ii]  >  c(l  —  5)  +  6(1  -  e)  and 
so  Pr[PA\S\  >  c  +  6(1  —  2c).  But  since  c  <  0.6, 
1  -  2c  >  0.  Hence  by  agreement,  6  =  0.  □ 

For  the  next  lemma,  recall  the  definition  of 
the  modified  level  of  process  t  in  run  R. 
This  lemma  serves  to  set  up  the  proof  of  the  following 
lemma.  Lemma  A.5. 

Lemma  A.4  Suppose  that  for  all  runs  R  and  for  all 
*  6  V,  Pr[Di\R]  =  0  if  MLi{R)  =  0.  Then  for  all  R 
and  i  e  V,  Pr[Di\R]  <  MLj(iJ)c. 

Proof:  By  induction  on  the  value  of  MLi(R).  Let 
MLi{R)  =  /. 

Base  Case,  1  =  0:  This  is  the  assumption  of  the 
lemma. 

Inductive  Step,  /  >  0:  Using  a  lemma  similar  to 
Lemma  5.2,  we  can  show  that  if  A  =  Clipi(R),  then 
there  is  some  k  such  that  MLt^R)  =  1  —  1.  Hence  by 
inductive  assumption,  Pr[Dt|P]  <  c(l  —  1).  Hence  by 
Lemma  2.2,  Pr[A|Pj  <  d-  But  by  Lemma  4.2  and 
Lemma  2.1,  PrfAlP]  =  Pr[A|P]  <  d-  □ 

Now  consider  a  run  Ri  in  which  only  process  1  re¬ 
ceives  an  input  message  and  no  other  message  is  de¬ 
livered  in  the  run.  The  next  lemma  states  that  if  the 
probability  of  process  1  attacking  in  this  run  is  exactly 
c,  then  we  can  prove  a  tighter  lower  bound  on  the 
decision  probabilities  than  the  bound  of  Lemma  5.3. 
Recall  that  the  bound  in  Lemma  5.3  was  stated  in 
terms  of  L,(P). 

Lemma  A.5  Suppose  that  Ri  =  {(vo,l,0)}, 

Pr[Di|Pi]  =  c  and  c  <  0.5.  Then  for  all  runs  R  and 
all  i  6  V,  Pr[A|f2]  <  MLi{R)t. 

Proof:  Consider  any  i  and  any  R  such  that  MLi{R)  = 
0.  Then  we  will  claim  that  Pr[A|P]  =  0.  To  do  this 
we  consider  two  cases,  one  of  which  must  be  true  if 
MLi{R)  =  0. 

•  (wo)-l)  does  not  flow  to  (i,  iV)  in  R.  Then 
Li(R)  =  0  and  hence  by  Lemma  5.3,  Pr[A|P]  = 
0. 

•  (1,0)  does  not  flow  to  (t,  N)  in  R.  Thus  t  ^  1  as 
(1,0)  flows  to  (1,  N).  Consider  the  run  C/ipj(P). 
By  the  deflnition  of  clipping,  there  is  no  tuple 
(*,!,♦)  in  Chpi{R),  because  if  there  was,  (1,0) 
would  flow  to  (i,  N)  in  R.  Consider  the  run  R  = 
Clipi{R)  U  {(vo,  1, 0)}.  By  construction,  the  only 
tuple  of  the  form  (*,  1,  t<)  in  P  is  (vq,  1, 0).  Hence 
1  and  i  are  causally  independent  in  R. 


Also,  Ri  =  _Clipi(R)  and  hence  Ri  £  R. 
Thus  Pr[Di|P]  =  e.  Hence  by  LanmaA.3, 
Pr[A|P]  =  0.  But  ClipiiR)  =  Clipi(R)  and 
so  by  Laauna4.2  and  Lanma2.1,  Pr[AjP]  = 

Pr[A|P]  =  0. 

Thus  in  either  case,  we  have  shown  that  for  any  t 
and  R,  Pr[A|P]  =  0  if  MLi{R)  =  0.  The  lemmanow 
follows  from  Lemma  A.4.  □ 

Lemma  A.6  Suppose  the  graph  G  is  connected  and 
has  diameter  no  more  than  N.  Then  there  is  a  run  R 
such  that  MLi(R)  =  ML(R)  =  1,  and  the  only  tuple 
of  the  form  (*,  1,  *)  is  (vo,  1, 0). 

Proof;  Let  T  be  a  spanning  tree  of  G  with  1  as  the 
root.  Such  a  tree  exists  because  G  is  connected.  Next 
we  define  R  as  follows. 

•  I{R)  =  {(vo)l>0)}  (i.e.,  only  process  1  receives 
input). 

•  For  all  i,j  6  V  and  1  <  r  <  JV,  (i,j,  r)  €  P  iff  i 
is  the  parent  of  j  in  the  tree,  (i.e.,  information 
only  flows  down  the  tree.) 

It  is  not  hard  to  see  that  since  the  height  of  the  tree 
is  no  more  than  N,  MLi(P)  =  1  and  MLi{R)  >  1  for 
all  i  e  V.  Thus  ML{R)  =  1.  □ 

We  now  return  to  the  proof  of  Theorem  A.l. 

Proof:  Suppose  there  is  some  protocol  F  such  that 
for  all  P,  £(F,  P)  >  e6fi(P). 

By  Lemma  A.6,  there  is  a  run  Pi  such  that 
MLi{Ri)  =  ML(Ri)  =  1  and  the  only  tuple  of  the 
form  (*,!,*)  in  Pi  is  (vo,l,0).  It  is  easy  to  verify 
that  Li(Pi)  =  1. 

Thus  by  assumption,  £(5,Pi)  >  cAfL(Pi)  =  c. 
Thus  by  Lemma  2.3,  Pr[Z)i|f2i]  >  e.  Alro,  by 
Lemma 5.3,  since  Li(Ri)  =  1,  Pr[Dt]Ri]  <  e. 
Hence,  Pr[i?i|fii]  =  <• 

Now  consider  the  run  Pj  =  Clipi(Ri)  = 

{(vo,  1,0)).  Then  by  Lemma 4.2  Pj  4  Pi.  Thus  by 
Lemma  2.1,  Pr[Di|P2]  =  e.  Hence  by  Lemma  A.5, 
for  all  i,P,  Pr[A|P]  <  (MLi(R).  Thus  for  aU 
P,  MiniPrlDijR]  <  MinieMLi(R).  Thus  from 
Lemma  2.3  and  the  definition  of  ML(R),  £(F,  R)  < 
€ML{R). 

Thus  we  have  shown  that  for  any  protocol  F,  if  for 
all  P,  £(P,P)  >  €AfI(P),  then  £(P,P)  =  cAfl(P). 
This  implies  the  theorem.  □ 
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Abstract 

Certain  types  of  routing,  scheduling  and  resource  allocation 
problems  in  a  distributed  setting  can  be  modeled  as  edge 
coloring  problems.  We  present  fast  and  simple  randomized 
algorithms  for  edge  coloring  a  graph,  in  the  synchronous 
distributed  point-to-point  model  of  computation.  Our  algo¬ 
rithms  compute  an  edge-coloring  of  a  graph  G  with  n  nodes 
and  maximum  degree  A  with  at  most  (1.6  +  e)A  +  log^'*'^  n 
colors  with  high  probability  (arbitrarily  close  to  1),  for  any 
fixed  e,  6  >0. 

To  analyze  the  performance  of  our  algorithms,  we  in¬ 
troduce  new  techniques  for  proving  upper  bounds  on  the 
tail  probabilities  of  certain  random  variables.  Chemoff- 
Hoeffding  bounds  are  fundamental  tools  that  are  used  very 
frequently  in  estimating  t2dl  probabilities.  However,  they  as¬ 
sume  stochastic  independence  among  certain  random  vari¬ 
ables,  which  may  not  always  hold.  Our  results  extend  the 
Chernoff-HoeffdiMg  bounds  to  certain  types  of  random  vari¬ 
ables  which  are  not  stochastically  independent.  We  believe 
that  these  results  are  of  independent  interest,  and  merit  fur¬ 
ther  study. 

1  Introduction 

An  important  limitation  for  a  distributed  network  with¬ 
out  global  memory  in  computing  any  function  is  that  for 
an  efficient  algorithm,  each  processor  can  communicate 
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only  with  those  processors  that  are  within  a  small  ra¬ 
dius  around  it.  Models  of  petrallel  computation  like  the 
PRAM  abstract  this  problem  of  locality  away  by  as¬ 
suming  the  existence  of  a  global  shared  memory  with 
fast  concurrent  access.  We  are  interested  in  studying 
how  fast  individual  processors  can  compute  their  por¬ 
tion  of  the  output  in  a  message-passing  distributed  sys¬ 
tem,  with  such  “local”  information  alone.  The  model 
we  study  is  the  synchronous  distributed  point-to-point 
model,  in  which  the  processors  are  arranged  as  the  ver¬ 
tices  of  an  n-vertex  graph  G  =  {V,E),  and  where  all 
communication  is  via  the  edges  of  G  edone. 

We  present  fast  and  simple  randomized  algorithms 
to  edge  color  G  with  at  most  (1.6  +  £)A  +  0.4  log^"*"*  n 
colors  with  high  probability  for  any  fixed  (,S  >  0,  where 
A  is  the  maximum  degree  of  the  vertices  of  G.  At  the 
heart  of  our  analysis  is  an  extension  of  the  Chernoff- 
IIoefFding  bounds,  which  are  key  tools  in  bounding  the 
tail  probabilities  of  certain  random  variables. 

The  edge  coloring  problem  can  be  used  to  model  cer¬ 
tain  types  of  jobshop  scheduling,  packet  routing,  and 
resource  allocation  problems  in  a  distributed  setting. 
For  example,  given  a  set  of  processes  V  and  a  set  of 
resources  K  such  that  each  process  p  G  P  needs  a  sub- 
/(p)  C  of  the  resources  where:  (i)  each  process 
p  needs  every  resource  in  /(p)  for  a  unit  of  time  each, 
and  (ii)  p  can  use  the  resources  in  /(p)  in  any  order,  we 
can  construct  a  bipartite  graph  Gp^n  =  E-p,ii) 

where  E-p^n  =  {(p,  r)|p  eV  r  &  /(p)}.  and  an  edge 
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coloring  of  with  c  colors  yields  a  schedule  for  the 
processes  to  use  the  resources  within  c  time  units. 

Edge  coloring  can  also  be  used  in  distributed  mod¬ 
els  in  situations  where  broadcasts  are  infeasible  or  un¬ 
desirable:  an  edge  coloring  of  the  network  results  in 
a  schedule  for  each  processor  to  communicate  with  at 
most  one  neighbor  at  every  step  (at  time  step  i,  pro¬ 
cessors  communicate  via  the  edges  colored  i  only),  and 
using  a  “small”  number  of  colors  reduces  the  wastage  of 
time  in  this  schedule. 

Note  that  A  colors  are  necessary  to  edge  color  a  graph 
with  maximum  degree  A.  Vizing  showed  that  it  is  al¬ 
ways  possible  to  edge  color  a  graph  with  A  +  1  colors 
and  gave  a  polynomial  time  algorithm  [7].  Karloff  & 
Shmoys  gave  an  RNC  algorithm  for  this  problem  that 
uses  A-l-A°  colors,  which  was  derandomized  in  NC 
by  Berger  &  Rompel,  and  Motwani,  Naor  &  Naor  [6, 14]. 
In  the  distributed  model,  the  best  known  edge  coloring 
algorithm  is  to  apply  a  vertex  coloring  algorithm  to  the 
line  graph  of  G.  There  are  fast  (polylogarithmic)  ran¬ 
domized  vertex  coloring  algorithms  that  use  (A-fl)  and 
A  colors,  which  translate  to  (2A  -  1)-  and  (2A  —  2)- 
edge  coloring  algorithms  respectively  [13,  15].  There 
are  no  known  polylogeirithmic  deterministic  algorithms 
in  the  distributed  setting  for  (2A  —  l)-edge  coloring 
[3,  15].  Moreover,  distributed  A-edge  coloring  requires 
Q(diameter(G))  time,  even  with  randomization[15]. 

This  work  is  part  of  a  larger  research  where  we  are  in¬ 
terested  in  studying  the  problem  of  a  distributed  model 
solving  some  problem  (related  to  scheduling  and  syn¬ 
chronization)  on  its  own  topology.  The  main  problems 
that  have  been  studied  in  this  regard  are  computing  a 
maximal  independent  set  (MIS)  in  G  [1,  12],  a  (A-f  1)- 
vertex  coloring  of  G  [3,  13],  and  a  A-vertex  coloring 
of  G  [15].  An  MIS  captures  the  idea  of  a  set  of  pro¬ 
cessors  working  in  parallel  without  interfering  with  the 
decisions  made  by  their  neighbors,  and  a  vertex  col¬ 
oring  is  a  partition  of  the  processors  into  independent 
sets,  yielding  a  schedule  for  the  processors  to  work  in 
parallel. 

An  important  point  about  all  of  these  problems  is  that 


they  need  just  a  local  search  for  an  incremental  update, 
in  the  following  seiuse.  Consider  the  simple  sequenti€d 
algorithm  to  compute  an  MIS-  pick  any  vertex  v,  add 
it  to  the  partial  MIS  computed  so  far,  remove  v  and 
its  neighbors  from  G,  and  repeat.  Thus,  a  partial  MIS 
can  be  updated  incrementally  with  a  local  search  of  ra¬ 
dius  1.  Similar  “local  search”  results  for  incremental 
updates  are  known  for  (A  +  1)-  and  A-vertex  coloring 
(and  hence  for  (2A  —  1)-  Eind  (2A  —  2)-edge  coloring): 
local  searches  of  radius  1  and  radius  G(log^  n)  respec¬ 
tively  [15].  Hence,  the  key  problem  in  implementing 
these  in  a  parallel  setting  is  symmetry  breaking:  par¬ 
allelizing  the  incremental  sequential  algorithm  (implied 
by  such  a  local  result)  by  somehow  breaking  the  sym¬ 
metrical  effort  of  zdl  the  processors  to  execute  one  incre¬ 
mental  step  of  the  sequential  algorithm.  Two  key  tools 
for  symmetry  breaking  are  randomization  [1,  12,  13]  and 
network  decomposition  [2,  3,  4,  5,  11,  15]. 

However,  such  a  local  search  result  is  not  known  for 
edge  coloring  with  less  than  2A  -  2  colors.  Note  that  if 
we  are  allowed  2A— 1  colors,  an  uncolored  edge  is  always 
surrounded  by  at  most  2A  -  2  colors,  and  a  local  search 
of  radius  1  is  sufficient  for  the  update.  It  is  not  clear 
that  any  local  result  should  hold  when  we  have  at  most 
2A  —  3  colors.  Hence,  one  main  contribution  of  this  pa¬ 
per  is  a  fast  and  simple  distributed  algorithm  for  a  prob¬ 
lem  not  falling  under  the  hitherto  dominant  paradigm  of 
“symmetry  breaking  and  local  search” .  Our  algorithms 
are  also  very  simple;  we  first  present  two  algorithms  for 
edge  coloring  bipartite  graphs,  and  then  extend  them  to 
general  graphs  using  an  idea  from  [10].  A  sketch  of  our 
first  algorithm:  given  a  bipartite  graph  G  =  {A,  B,  E) 
with  meocimum  degree  A,  each  vertex  in  B  picks  distinct 
colors  from  {1, 2, . . . ,  A}  at  random  for  its  edges;  then, 
each  vertex  u  6  A  checks,  for  each  color  a,  if  more  than 
one  of  its  incident  edges  has  color  a  and  if  so,  chooses 
one  of  them  at  random  as  the  winner,  and  all  the  other 
edges  of  color  a  which  are  incident  to  v  are  decolored. 
The  key  claim,  which  requires  an  interesting  analysis, 
is  that  for  every  vertex,  the  number  of  decolored  edges 
incident  to  it  is  at  most  A(1  -I-  f)/e  with  high  probabil- 
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to  t.  That  is,  to  use  the  fact  that 


ity.  Assuming  that  this  holds,  we  can  repeat  the  above 
iteration  with  a  set  of  A(1  +  €)/c  fresh  colors,  etc.,  and 
the  above  claim  allows  us  to  bound  the  number  of  colors 
used,  with  high  probability. 

The  second  main  contribution  of  this  paper  is  the 
tools  developed  to  analyze  the  2dgorithms.  Chemoff- 
Hoeffding  (henceforth  CH)  boundsfS,  9]  are  fundamen¬ 
tal  tools  used  in  bounding  the  tail  probabilities  of  the 
sum  of  independent  random  variables  [17].  The  most 
frequent  form  in  which  these  bounds  are  used  is  when 
there  are  n  independent  random  bits  X\,X2,---Xn, 
X  =  Xi,  auid  fi  =  £^[X];  in  this  case,  it  is  possible 
to  show  that  Pr{X  >  (1  +  ^)/i)  <  F'''(/i,6)  V6  >  0, 
where  is  inverse  exponential  in  n  and  6^,  for 

small  enough  6  [16,  17]. 

We  introduce  a  new  way  of  looking  at  the  CH  bounds 
amd  prove  that  CH  type  bounds  can  be  used  for  the 
sums  of  certain  types  of  dependent  random  variables 
too.  A  generalization  to  some  form  of  dependence  is 
known  [18],  but  it  is  not  strong  enough  to  be  used  in 
our  applications.  We  also  prove  similar  results  to  certain 
types  of  dependent  non-6inory  random  variables.  These 
new  extensions  of  the  CH  bounds  are  used  crucially  in 
the  analysis  of  our  algorithms.  We  believe  that  these 
results  have  the  potential  for  further  applications,  and 
need  further  study. 

2  A  Generalization  of  Chernoff- 
Hoeffding  Bounds 

Chernoff-Hoeffding  (henceforth  CH)  bounds  are  im¬ 
portant  tools  used  in  estimating  the  tail  probabili¬ 
ties  of  random  variables.  Given  n  random  variables 
X\,X2,  ■  ■ .  ,Xn,  these  bounds  are  used  in  deriving  an 
upper  bound  on  the  upper  tail  probability  Pr{X  > 
(1  +  €)fi),  where  X  =  Xi,  p  =  E[X],  and  e  >  0. 
Chernoff’s  basic  idea  for  bounding  the  upper  tail  is  to 
use  Markov’s  inequality  on  the  random  variable  e*^  for 
an  arbitrary,  but  fixed,  t  >  0,  and  minimize  with  respect 


Pr{X  >  (1  +  e)p)  =  Pr(e‘^  > 

amd  minimize  the  last  ratio  for  t  >  0  [8].  Raghavan 
and  Spencer  use  this  idea  to  bound  the  upper  tail  when 
XifX^, . . .  ,Xn  are  independent  random  bits  [16,  17], 
and  show  that  in  this  case, 

Pr(X  >  (1  +  ,V)  <  f -"(P.  0. 

If  €  is  a  fixed  positive  quantity  and  is  at  most  1  (which 
will  be  true  in  all  our  applications),  then  F‘''(/i,e)  < 
g-Pit/3  If  —  fl(log*'*'‘’t)  then  F"*'(/i,e)  is  asymptoti¬ 
cally  less  than  any  inverse  polynomial  in  t,  for  any  fixed 
c>  0.  This  fact  will  be  used  in  our  algorithms. 

Hoeffding  also  has  derived  a  (slightly  stronger)  bound 
on  the  upper  tail  of  the  sum  of  independent  0-1  ran¬ 
dom  variables  [9].  Further,  he  has  proved  bounds  on 
the  upper  tail  of  the  sum  of  independent  and  bounded 
non-binary  variables,  and  for  the  sum  of  certain  types 
of  bounded  euid  mutually  dependent  random  variables. 
In  this  section,  we  present  upper  bounds  on  the  upper 
tails  of  some  types  of  dependent  random  variables.  We 
present  an  important  special  case  first  (Theorem  1)  be¬ 
fore  showing  the  general  result  (Theorem  2). 

2.1  Self-weakening  binary  random  vari¬ 
ables 

Given  n  0-1  random  variables  Ai,  A2, . . .,  we  call 
them  self-weakening  with  parameter  A  if  Pr(Aj,  = 
=  •  •  •  =  =  1)  <  A  •  nf=i  =  1;  for  all 

distinct  indices  in  the  range  {1,2,  ...,n}. 

Note  that  if  A  =  1,  a  sufficient  condition  for  this  to  hold 
is  that  Vi ;  1  <  i  <  n,  and  V(ji, . . . ,  jt)  :  {ji, .  C 
{l,2,...,i-l},  Pr{Xi  =  l\Xj,  =  Xj,  =  ...  =  Xj,= 
1)  <  Pr{Xi  =  1). 

Theorem  1  Consider  n  0-1  random 
variables  Xi , X2,  •  ■  .,X„  which  are  self-weakening  with 
parameter  A,  for  some  A  >  0.  Let  X  =  Xi,  and 
let  E[X]  <  fi* ,  for  some  fi* .  Then, 


253 


Pr(X>(l  +  e)/i*)<AF+(Ai*,e). 

Note  that  any  upper  bound  ft*  on  £'[^]  will  do.  This 
theorem  is  a  special  case  of  Theorem  2,  which  will  be 
proved  in  subsection  2.2.  We  now  present  applications 
of  Theorem  1,  which  will  be  used  later  and  are  also  of 
independent  interest. 

Example  1  A  balls  are  thrown  uniformly  at  random, 
and  independently  of  each  other,  into  A  6ins.  Let  X 
be  the  random  variable  denoting  the  resulting  number  of 
empty  bins.  Then,  for  any  €  >  0,  Pr{X  >  (H-€)A/e)  < 
F+(A/e,e). 

Proof.  Let  Xi  be  the  indicator  variable  for  the 
event  that  bin  i  was  empty  after  the  experiment; 
note  that  X  =  check  that 

^[X]  <  A/e  =  ft*.  Notice  that  the  Xj’s  are  depen¬ 
dent  and  hence  we  cannot  apply  the  CH  bounds  di¬ 
rectly.  However,  the  X,’s  are  self-weakening  with  pa¬ 
rameter  1  since,  for  any  set  of  indices  {ji,  j2,  •  •  •  i  Jfc}  C 
{1,2,..., A}  -  {i},  we  have  that  Pr(Xj  =  IjXj,  = 
Xf,  =  •••  =  Xy,  =  1)  =  (1  -  1/(A  -  k))^  < 
(1  -  1/A)^  =  PriXi  =  1). 

Hence,  Pr(X  >  (1  +  €)A/e)  <  F+{Ale,€)  follows 
from  Theorem  1.  □ 

Remark.  Jain  has  proved  the  following  lemma  [18]:  Let 
01,02,..., On  be  n  random  trials  (not  necessarily  inde¬ 
pendent)  such  that  the  probability  that  trial Oj  ’succeeds’ 
is  bounded  above  by  p  regardless  of  the  outcomes  of  the 
other  trials.  Then  if  X  is  the  random  variable  that  rep¬ 
resents  the  number  of  ’successes  ’  in  these  n  trials,  and 
Y  is  a  binomial  variable  with  parameters  (n,p),  then: 
Pr[X  >  fc]  <  Pr[y  >  fc],0  <  ib  <  n. 

The  assumptions  of  Jain’s  lemma  are  strictly  stronger 
than  the  self-weakening  property  with  parameter  1; 
such  strong  assumptions  do  not  hold  in  example  1 .  For 
instance,  Pr{X^  =  l(Xt  =  X2  =  •  •  •  =  Xa-i  = 
0)  =  2^1  which,  for  large  enough  A,  is  greater  than 
Pr(XA  =  1)(»  1/e). 


Another  application  of  Theorem  1,  which  will  also  be 
used  later  in  the  analysis  of  our  algorithms,  is  given  in 
the  following  exaunple. 

Example  2  Suppose  d  <  A  balls  are  thrown  uniformly 
at  random  and  independently  into  A  bins,  and  tn  each 
bin  with  at  least  one  ball,  one  of  the  balls  from  that  btn 
is  chosen  at  random,  and  the  other  balls  in  that  bin  are 
discarded.  Denoting  by  Zd  the  number  of  discarded  balls, 
Pr{Zi  >  (1  -t-£)A/e)  <  F+(A/e,e). 

Proof.  Omitted.  □ 

2.2  A  new  look  at  ChernofF-Hoeffding 
type  bounds 

Hoeffding[9]  has  shown  that  if  n  random  variables 
Xi,X2,  ..  .,X„  are  independent  with  Oj  <  Xi  <  bi 
(1  <  i  <  n),  and  if  X  =  with  F[X]  =  p, 

then  for  any  t  >  0, 

Fr(X  >  (l  +  «)/i)  < 

=  G'^(p,€,ai,6),a2,62,...,an,6n). 

Note  that  if  Vi,  6,  —  a,  is  bounded  by  some  consteint 
and  if  p  =  ©(«), 

then  e,  oi,  61, 02, 62, ... ,  o„,  t„)  =  which 

is  asymptotically  lesser  than  any  inverse  polynomid  in 
n,  for  any  fixed  e  >  0.  We  now  present  CH-bounds  that 
hold  for  certain  types  of  non-binary  random  variables, 
which  satisfy  a  condition  of  which  “self-weakening”  is 
a  special  case. 

Theorem  2  Let  X  =  ]Ci"=i  ^kere  each  X<  is  a  ran¬ 
dom  variable  such  that  X<  G  [ai,6<].  If  E[X]  <  p*  and 
there  exists  X  >  0  such  that,  for  all  nonempty  sets  of 
indices  I  C  {!,...,«}  and  strictly  positive  integer  val- 
ues  Si,  Firiie.X'q  <  X  •  Uiel  Pr{X  > 

(1  +  e)p*)  <  X  •  G'*'(/i*,£, 01,61,02, 62, . .  .,On,6„). 

Proof.  (Sketch).  Since  X  is  bounded  by  assump¬ 
tion,  it  follows  for  any  t  (in  particular,  for  any  t  >  0), 
that  linearity  of  expectation  holds  for  the  infinite  sum 


25A 


£'[A’*]  contains  terms  of  the  form  £'[117=1  which, 
by  assumption,  is  at  most  A  •  07=1  Note  that 

if  Vi , . . . ,  Vn  are  independent  random  variables  with  Yi 
having  the  same  distribution  as  Xi,  then  £[e‘'^]  = 
n7=i  where  Y  =  537=1 

all  non-negative  integers  si ,  S2, . . . ,  s„,  £[n7=ii;'‘]  = 
n7=i^^[>i'‘]  =  n7=i^[^7‘]-  Hence,  £[c‘^]  <  A- 
07=1  £'[c‘'*^‘]  and  so  for  any  <  >  0, 


Pr(A'>(l  +  c);i*)< 


£[e‘^] 

e«(i+0M* 


nr., 

el(i+0(**  ’ 


which  can 

be  bounded  by  A  -  G+(/i*,e,ai,6i,  03,62, . .  .,a„,6„),  by 
picking  a  suitable  t  >  0  [9].  □ 

A  special  case  of  Theorem  2  is  given  in  the  following 
corollary  and  should  be  of  independent  interest. 


Corollary  1  Let  X  =  537=1  Xi  is 

a  non-negative,  discrete  valued  random  variable  with 
values  in  [ai,6<].  If  £[A]  <  y.*  and  there  exists  a 
A  >  0  such  that,  for  all  nonempty  sets  of  indices 
I  C  and  c,  >  0,  Pr[A,g/A,-  =  c,]  < 

^  •  Hie/ >  (1  +  €)A‘*)  < 
A  ’  ,  f ,  oi ,  61 ,  fl2>  62, . . . ,  a„,  bfi}' 

Proof.  It  is  easily  checked  that  the  assumptions  of 
this  corollary  are  a  special  case  of  the  assumptions  of 
Theorem  2.  □ 

Theorem  1  is  actually  a  corollary  of  Theorem  2;  the 
assumptions  of  Theorem  1  again  are  a  special  case  of 
those  of  Theorem  2,  and  Theorem  1  follows  from  the 
proof  of  Theorem  2,  and  from  the  existence  of  a  suitable 
t  >  0  to  make  the  upper  tail  bound  at  most  A  •  F’^{y* ,  t ) 
[16,  17]. 

The  basic  idea  of  Theorem  2  is  that  if  we  can  upper- 
bound  each  term  of  the  series  expansion  of  £[e*^]  with 
an  equivalent  term  from  the  series  expansion  of  £[e‘^], 
where  U  is  the  sum  of  independent  random  variables 
Ui's  whose  range  is  the  same  as  that  of  the  X<’s,  the 


CH-bounds  of  U  apply  also  to  the  random  variable  X. 
With  the  same  reasoning  of  Theorem  2  we  can  prove 
the  following  corollaries. 

Corollary  2  Let  X  =  5)7=i  each  Xi  is  a 

random  variable  such  that  Xi  E  [u«,6i],  and  let  U  = 
537=1  where  the  Ui ’s  are  independent  random  vari¬ 
ables  such  that  Ui  E  [aj,6i].  If  E[X],E[U]  <  y*  and 
there  exists  A  >  0  such  that,  for  all  nonempty  sets  of 
indices  /  C  {1, . .  .,n}  and  strictly  positive  integer  val- 
ues  Si,  EillieiXn  <  A  •  then  Pr{X  > 

(1  -h  e)y*)  <  A  •  €,  oi,  61, 03, 63, ... ,  a„,  6„). 

Corollary  3  Let  X  =  537=1  ^here  each  Xi  is  a 
random  variable  such  that  Xi  E  [0, 1],  and  let  U  = 
537=1  ^ii  ^^th  the  Ui ’s  being  independent  binary  ran¬ 
dom  variables.  If  E[X],E[U]  <  y*  and  there  exists 
A  >  0  such  that,  for  all  nonempty  sets  of  indices 
I  C  {!,...,«}  and  strictly  positive  integer  values  Si, 

^hen  Pr(X  >  (l+e)y*)  < 

A- £+(/!*, e). 

Both  corollaries  will  be  used  in  the  analysis  of  our 
algorithms.  The  next  example  contains  an  application 
of  corollary  2.  We  first  give  a  simple  lemma  without 
proof. 


Lemma  1  //0<p<l,  g=l— p  and  m  ts  a  positive 
integer,  then 


Example  3  Suppose  d  white  balls  have  been  thrown  at 
random  into  A  bins,  1  <  d  <  A.  After  this,  a  red 
ball  is  thrown  at  random  into  one  of  the  bins,  one  ball 
is  chosen  at  random  from  the  bin  in  which  the  red  ball 
fell,  and  the  other  balls  in  that  bin  are  discarded.  Let 
Z  be  the  random  variable  denoting  the  probability  that 
the  red  ball  ts  discarded  as  a  function  of  the  positions  of 
the  white  balls  {Z  itself  a  random  variable  depending  on 
the  positions  of  the  white  balls).  Then,  E[Z]  <  1/e  and 
Pr(Z  >  (1  -i-  e)l/e)  < 
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Proof.  Let  Z,-  be  the  random  variable  denoting 
the  number  of  white  balls  in  bin  i.  Then,  Z  = 
E.ti  +  !)•  Let  +  1)  and 

y  =  Y^Yi.  Note  that  the  Yi ’s  are  discrete  random  vari¬ 
ables  with  values  in  [0, 1]  and  that  Y  =  A  ■  Z.  We  will 
show  that  E\Y]  <  A/c  and  that  Pr{y  >  (l+e)A/e)  < 
which  will  give  our  claim. 

Firstly,  we  may  assume  that  d  =  A  -  1:  Pr(Y  > 
(1  -f  €)A/e)  is  maximized  at  d  =  A  —  1,  as  d  varies  from 
1  to  A  —  1  (proof  omitted).  First,  we  will  show  that,  for 
all  i,  £[yi]  <  1/e  and  then  we  will  show  that,  for  any 
set  of  fc  indices  I  C  {1,2, . . .,  A}  and  strictly  positive 
integers  s,. 


z[ni/\y  =  c.]<-  (4) 

•=i 

then,  since  if  some  c,  is  zero  then  the  product 
yiYa . . .  Yifc_i  is  also  zero,  we  see  by  induction  on  k  that 

k 

zilly.]  =  £:[yi...n-iziyi|yi...yi-i]] 

•=i 

t=i 


(1) 

»€/ 

Given  this  we  can  apply  Corollary  2  by  introducing  n 
independent  0-1  random  variables  Ui  such  that  E[Ui]  = 
Pr{Ui  =  1)  =  1/e.  Since  the  Ui's  are  binary,  equation  1 
is  the  same  as  V/*]  <  flie/ 

Noting  that  0  <  y  <  1,  it  suffices  to  show  that 

(2) 

.€/ 

Without  loss  of  generality  we  can  assume  I  = 
{1, . .  We  will  prove  inequality  2  by  induction  on 
ib  >  1;  when  1:  =  1, 

E[Yi]  =  )(1/A)^(1-1/A)^-'-'^ 

=  (1  -  1/A)^  <  1/e, 

where  the  second  equality  follows  from  Lemma  1.  No¬ 
tice  that  for  all  j,  1  <  j  <  A,  E\Yj]  =  £[yi]  <  1/e. 
When  ib  >  1 ,  the  law  of  conditional  probabilities  gives 

k 

£^[11  •  •  -n-i^iniyi, . . . ,n_i]].  (s) 

«=i 

If  we  show  that,  for  all  non-zero  a  €  [0, 1]  with  i  € 
{l,...,fc-l}. 


Hence,  Example  3  follows  if  we  can  show  that  inequal¬ 
ity  4  holds. 

If  Qi  denotes  the  number  of  white  balls  that  fell  into 
bin  i,  then  Cj  =  ai/(ai  -b  1).  Let  a  =  Oi  >  k  —  1, 
p  =  1/(A  -  k  +  1),  and  g  =  1  —  p.  Then 


t-i 


£ini/\y=c,]  =  £[n!/\Z<=a<l 

»=i  «=i 

A-l— o 

=  Yl 

r=l 

where 


It  is  easy  to  check  that  <(r,  a)  >  f(r,  a  -b  1).  As  a 
consequence,  the  maximum  value  of  E\Yh  |  Af=/  =  ‘^*1 
is  attained  at  a  =  ib  —  1,  in  which  case  we  have 


A— 1—0 

Yl 


r=l 


^~k 


r=l 


„A-t+l 


<l/c, 


by  Lemma  1. 


□ 
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3  Edge  Coloring  Algorithms  for 
Bipartite  Graphs 

We  present  two  bipartite  edge  coloring  algorithms  now. 
In  the  next  section,  we  will  see  how  they  can  be  used 
as  subroutines  to  compute  edge  colorings  of  arbitrary 
graphs  within  the  same  bounds.  We  first  present  a  sim¬ 
ple  distributed  Monte  Carlo  algorithm  for  edge  coloring 
bipartite  graphs,  and  analyze  its  performance  with  the 
techniques  developed  in  the  preceding  section.  The  al¬ 
gorithm  takes  O(log  n)  time  and  colors  a  bipartite  graph 
G  of  degree  A  with  approximately  1.6A  -i-  0.4  log^'*’*  n 
colors  with  high  probability,  for  any  i  >  0  (the  fail¬ 
ure  probability  will  depend  on  i).  We  will  say  that 
a  statement  holds  with  high  probability  (w.h.p.)  if  the 
probability  that  it  holds  is  at  least  1  —  l/f{n),  where 
/(n)  is  asymptotically  greater  than  any  polynomial  of 
n. 

Given  a  bipartite  graph  G  =  {A,  B,  E),  we  denote  by 
6{u)  the  set  of  edges  incident  on  vertex  u.  Each  vertex 
knows  whether  it  belongs  to  A  or  B.  This  is  an  im¬ 
portant  assumption  because  it  cannot  be  computed  fast 
distributively  [15],  but  it  will  be  removed  when  we  dis¬ 
cuss  edge  coloring  general  graphs.  We  initialize  a  vari¬ 
able  Acurr  to  A;  during  any  iteration  of  the  algorithm, 
Acorr  is  meant  to  be  an  upper  bound  on  the  degree  of 
the  current  graph;  we  will  prove  later  that  this  holds 
w.h.p.  The  algorithm  is  given  two  parameters  €,6  >  0, 
and  is  as  follows: 

1.  Part  I;  Repeat  until  A^urr  < 

Let  G'  be  the  current  graph.  Pick  a  set  x  of  Acurr 
fresh  new  colors. 

•  (Assigning  temporary  colors)  In  parallel  and 
independently  of  the  other  vertices  in  B,  each 
vertex  v  £  B  assigns  a  temporary  color  to  each 
edge  in  6(u)  with  uniform  probability  without 
replacement,  i.e.  edge  ci  is  assigned  color  a  £ 
X  with  probability  1/ Acurr,  £2  is  assigned  13  £ 
X  —  {a}  with  probability  l/( Acurr  -  1)  and  so 
on. 


•  ( Choosing  winners)  The  coloring  so  far  is  con¬ 
sistent  around  any  vertex  v  £  B  hut  can  be 
inconsistent  around  a  vertex  u  €  A.  For  each 
u  £  A,  let  Cu(a)  be  the  set  of  edges  in  6(u) 
with  temporary  color  a.  Each  vertex  u  €  A  se¬ 
lects  a  winner  uniformly  at  random  in  Cu(a), 
for  each  nonempty  Cu(a).  All  other  edges  are 
decolored  and  assigned  color  ±. 

•  Set  Acurr  :=  Acurr{l  +  f)/e.  G±,  the  sub¬ 
graph  of  G'  induced  by  the  ±-edges,  becomes 
the  new  current  graph. 

2.  Part  II:  Let  Gr  be  the  remaining  graph.  Edge 
color  Gr  with  2A(Gr)  —  1  colors  by  executing 
Luby’s  vertex  coloring  algorithm  on  the  line  graph 
of  Gr  for  O(logn)  time  [13]. 

To  bound  the  number  of  colors  used,  we  will  prove 
that  the  maximum  degree  of  the  graph  shrinks  by  a 
factor  of  at  least  (1  -1-  e)/e  w.h.p.  in  every  iteration  of 
part  (I)  above,  i.e.,  that  in  every  iteration,  Acurr  is  an 
upper  bound  for  the  degree  of  the  graph  at  the  start 
of  that  iteration.  This  would  imply  that  the  maximum 
degree  of  Gr  is  at  most  log*'*'*  n  w.h.p.  Hence,  w.h.p., 
the  number  of  colors  used  is  at  most 

C  =  A  -I-  -(1  -t-  f)  -I- . . .  -I-  4(1  +  +  21og2+‘  n, 

e  e 

where  k  is  the  smallest  integer  such  that  A(H-€)^/e*  < 
log^"*"^  n.  The  total  number  of  colors  C  is  upper 
bounded  by 

C  <  (^4-e')A-K2-^-e')log2+‘u 

as  1.6A -f  0.41og^'''^  n. 

Here,  e'  depends  on  f  and  can  be  made  arbitrarily 
small.  The  running  time  of  the  algorithm  is  O(logn): 
Part  I  takes  O(logA)  time  and  Part  II,  i.e.,  Luby’s  al¬ 
gorithm,  takes  O(logn)  time. 

Consider  any  iteration  of  part  (I)  above,  with  an  up¬ 
per  bound  Acurr  on  the  degree  of  the  graph  at  the  be¬ 
ginning  of  that  iteration.  For  each  edge  e,  we  introduce 
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an  indicator  random  variable  He  that  is  1  when  e  gets 
color  ±  in  that  iteration,  and  0  otherwise  (if  e  has  a 
vertex  v  as  an  endpoint,  we  also  denote  He  by  i/^(c'). 
When  He  =  I  y/e  say  that  edge  e  is  hit.  The  variable 
Hti  —  IZegj(t)^e  equals  the  number  of  edges  incident 
to  V  that  are  hit,  and  represents  the  degree  of  t;  in  the 
next  iteration.  It  is  easy  to  show  that,  for  each  vertex 
V,  E[Hv]  <  Acurr/e-  But,  we  also  need  to  estimate 
the  tail  probability  of  the  distribution  of  Hy .  With  the 
techniques  developed  in  the  previous  section ,  we  will  be 
able  to  prove  CH-bounds  for  the  upper  tail  of  Hy  and 
this  will  be  sufficient  for  our  purposes,  since  if  /i  denotes 
^curr f  then 

Pr(Ax  >  (1  +  €)/j)  =  Pr(3v  :  P„>(l  +  €)/i) 

<  Yi 

tigyluB 
^  Ti  ‘  C  ‘ 

where  Ai  is  the  new  degree  of  the  graph  after  that 
iteration,  c  •  Acurr  •  F‘*'(\//i/e,€/2)  is  the  upper  bound 
that  we  will  prove  on  the  upper  tail  of  the  Hy's,  and  c 
is  a  constant.  Henceforth,  \/Acurr  will  be  denoted  by 
Aq.  The  following  lemma  is  easy: 

Lemma  2  fn  any  iteration  and  for  any  edge  (u,v), 
Fr((u,  n)  is  hit)  <  e~^. 

Hence,  given  that  Acurr  was  an  upper  bound  on  the  de¬ 
gree  of  the  graph  at  the  start  of  that  iteration,  £'[ffv]  < 
Acurr /e,  for  all  vertices  v.  We  now  want  to  prove 

Theorem  3  In  any  one  iteration,  Pr{Hy  >  (1  H- 
f)Acor. /e)  <  c  ■  Acurr  '  F‘''( Aq/c,  f/2)  for  all  vertices 
V,  given  that  Acurr  was  an  upper  bound  on  the  degree 
of  the  graph  at  the  start  of  that  iteration. 

The  proof  of  this  theorem  will  take  several  lemmas. 
Note  that  F‘^(Ao/e,  f/2)  <  exp(-e^Ao/12e)  if  €  <  2; 
this  is  asymptotically  less  than  any  inverse  polynomial 
in  n  when  Acurr  =  n(log^‘^^  n)  for  any  fixed  6  >  0, 
if  e  is  fixed.  Hence,  if  theorem  3  is  true,  then  Acurr 


is  an  upper  bound  on  the  maximum  degree  in  every 
iteration  w.h.p.  When  Acurr  is  smaller  than  log^"*”^  n, 
the  probability  bound  is  not  good  any  more,  and  so  we 
switch  to  Part  (II)  of  the  algorithm  then. 

Henceforth,  we  focus  on  any  one  iteration,  assuming 
that  Acurr  was  an  upper  bound  on  the  maximum  degree 
at  the  start  of  that  iteration.  It  turns  out  that  proving 
Theorem  3  is  much  simpler  for  the  A  vertices  than  for 
the  B  vertices.  We  start  by  analyzing  the  vertices  in 
A.  Given  some  vertex  u  G  .A,  we  want  to  upper  bound 
the  upper  tail  of  Hu-  We  show  now  that  for  the  edges 
e  6  tf(u),  the  random  variables  He,  though  dependent, 
are  self-weakening  with  parameter  1.  Note  that,  as  far 
as  u  6  >4  is  concerned,  each  of  the  edges  in  6(u)  chooses 
a  color  independently  at  random  with  probability  A“u,r; 
hence  the  situation  is  analogous  to  example  2.  Hence, 
by  example  2, 

Lemma  3  For  any  vertex  v  €  A,  Pr(Hy  >  (1  -t- 
e)Acurr/e)  <  F+(Ac„rr/e,<). 

The  derivation  of  such  a  result  for  vertices  in  B  is 
much  harder.  The  problem  can  be  seen  in  Figure  1. 
Suppose  we  are  given  that  e\  got  temporary  color  a 
and  was  hit,  and  that  e2  got  temporary  color  we  will 
argue  intuitively  that  given  this,  the  probability  of  e2 
being  hit  has  increased.  Since  Ci  was  hit,  the  probability 
of  €3  getting  temporary  color  a  increases,  which  implies 
that  the  probability  of  64  getting  temporary  color  ^  also 
increases,  and  this  increases  the  probability  of  62  being 
hit.  So,  it  is  not  obvious  that,  for  i>  6  F,  the  /f„(e)’s 
are  self- weakening;  in  fact,  the  above  argument  seems  to 
imply  the  antithesis  of  that.  With  aseries  of  lemmas,  we 
will  prove  that  they  are  self-weakening  (with  parameter 
to  be  determined  later),  and  Theorem  3  will  follow.  We 
will  again  focus  on  any  particular  iteration,  at  the  start 
of  which  Acurr  was  an  upper  bound  on  the  maximum 
degree  of  the  graph.  We  consider  some  vertex  v  £  B, 
and  assume  that  its  degree  is  Acurr  (easily  extended  to 
the  case  degree{v)  <  Acurr)- 

Let  u’s  neighbors  in  be  uj,  U2, . .  ■ ,  «Ac.rr> 
note  the  edge  (u,Ui)  by  Cj.  For  technical  reasons  that 
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will  be  clear  later,  we  subdivide  the  edges  6(ii)  into  Ao 
groups  (recall  that  Ao  denotes  \/Acurr)i  such  that  each 
group  has  Ao  edges.  Let  HI  be  the  random  variable 
counting  the  number  of  edges  of  S(v)  in  group  i  that  are 
hit.  Then,  Pr{Hv  >  (1  4-t)Acurr/e)  <  1  <i< 

Ao  :  Hi  >  (1  +  e)Ao/e)  <  E.ti  Pr{Hi  >  (H-e)Ao/e). 
We  will  prove  that  the  generalized  CH  bounds  hold  for 
any  of  the  HI ’s  and  ge+ 

Pr(f/„  >  (1  +  £)Ac„rr/e)  <  AoPr{H  >  (1  +  ()Ao/e) 

^  C  '  Acurr  *  P+(Ao/e,e/2) 

where,  to  simplify  our  notation,  H  stands  for  the  sum 
of  any  choice  of  at  most  Ao  random  variables  chosen 
among  the  /f«(ei)s,  say,  /f«(ei), //«(c2), . . . ,  P«(eAo)- 

To  estimate  the  tail  probability  of  /f ,  we  assume  that 
V  was  the  last  among  the  vertices  in  B  to  choose  its  per¬ 
mutation,  i.e.,  that  the  edges  cj  =  (v,  ui), . . . ,  e^crr  = 
(t^,  “A„rr)  8°^  their  temporary  colors  after  all  other 
edges  incident  to  the  u< ’s  had  got  temporary  colors.  De¬ 
fine,  for  each  «<,  a  column  vector  Ci  with  Acurr  entries, 
where  Ci(j)  =  a/(o  -I- 1)  iff  Uj  has  exactly  a  edges  with 
temporary  color  j  before  the  random  choice  of  edge  Cj. 

To  prove  that  the  /f„(e<)’£  are  self-weakening,  we  ana¬ 
lyze  the  generic  term  Pr(//v(ej,)  =  •  •  •  =  P«(e,^)  =  1), 
where  k  <  Aq  and  each  ij  is  in  the  range  1,2,. ..,  Aq. 
Construct  a  Acurr  X  k  matrix  M  by  concatenating 
M  contains  the  information  for  computing 
Pr(H\  =  H2  -  ■  ■  =  Hk  =  1)  (where  Hj  denotes  Hu(ei-), 

1  <  j  <  ife),  and  this  will  be  our  next  step.  Though  M 
is  not  square,  we  define  its  permanent  perm{M)  in  the 
obvious  way:  perm{M)  =  1111^=1  where  the 

sum  is  taken  over  all  the  a.s  in  the  range  {1,2,...  Acurr} 
such  that  o,  ^  aj  for  i  ^  j.  In  what  follows,  let 
p{n,k)  =  n(n  —  1) . .  .(n  —  it  -|- 1). 

Lemma4Pr(Pi  =  //j  -  =  P*  =  1)  = 

perm{M)lp(  Acurr, k). 

Let  5(C7i)  be  the  sum  of  the  entries  of  Ci,  E\  be  the 
event  “for  all  »,  1  <  »  <  Ao,  5(C,)  is  at  most  Acurr(l  + 


c')/c”  (for  some  >  0  to  be  specified),  and  E2  be  the 
event  “P  >  (l-t-£)Ao/e”.  In  example  3  we  showed  that, 
for  any  Ci,  Pr{S{Ci)  >  (l+e')Ac„rr/e)  < 

(S(Ci)  corresponds  to  the  variable  Y  of  Example  3);  it 
follows  that  Pr(Ej)  <  Aq  ■  ! P . 

It  is  not  difficult  to  prove  that  perm{M)  is  at 
most  the  product  of  the  column  sums  in  M,  hence 
Lemma  4  implies  that  given  Ei ,  perm{M)  <  Aj„,.,.(l  -f- 
e')*'/(c‘p{ Acurr,  i))-  Next,  a  simple  lemma: 

Lemma  5  For  positive  integers  t  and  k,  t’‘/p{t,k)  < 
e*"/',  ifk<t/2. 

By  applying  Lemma  5  with  t  =  Acurr  and  k  <  Aq  = 
y/A  ctirrj  we  have  that  given  E\ , 

perm{M)  <  e  ■  (1  -t-  <  e  ■  e‘  (5) 

Inequality  5  tells  us  that  Pr{Hv{ej^)  =  Huiej^)  = 
••  •  =  Pu(e>J  =  llPi)  <  e  •  e^'^o  •  Pr(Xj,  =  Xj,  = 
•  =  Xj„  =  1),  where  {ji,  j2,  •  •  • ,  Jt}  C  {1,2, . . .,  Ao}, 
and  Xi,X2, . . . ,  Xao  are  independent  random  bits  with 
Pr{Xi  =  1)  =  1/e,  1  <  »  <  Ao-  Hence,  by  Corollary  3 
we  see  that  Pr(£'2|£'i)  <  e  •  P+(Ao/e,e).  Thus, 

Pr(£;2)  =  Pr{E2\Ei)Pr{E,)  +  Pr{E2\El)Pr{Et) 

<  Pr{E2\Ei)  +  Pr{E‘i) 

<  e  •  e'  ■  P‘''(Ao/e,e)  -t-  Ao  • 

<  c-AoP+(Ao/e,e/2), 

for  some  constant  c  and  for  large  enough  Ao,  by  choos¬ 
ing  ('  =  e^/(6e). 

Lemma  6  For  all  v  E  B  and  any  iteration  at  the  start 
of  which  Acurr  woe  an  upper  bound  on  the  maximum 
degree  of  the  graph,  Pr{Hu  >  (1  +  e)Acurr/e)  <  c  • 

Acurr  F+(^/S  curr  /e,£/2). 

This  concludes  our  proof  of  Theorem  3.  We  now  give 
a  brief  sketch  of  another  bipartite  edge  coloring  algo¬ 
rithm.  The  idea  of  the  algorithm  is  to  repeatedly  re¬ 
move  a  randomly  generated  matching  and  to  make  it 
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a  color  class.  The  matching  is  generated  so  that  edges 
incident  on  “high”  degree  vertices  are  removed.  This 
ensures  that  at  each  step  the  degree  goes  down  by  at 
least  one.  A  nice  property  of  this  algorithm  is  that  if  A 
is  “slightly  more”  than  n(logn)  (to  be  quantified  later), 
then  after  0(A)  stages  the  diameter  of  any  remaining 
connected  component  of  the  graph  is  O(logn).  This  al¬ 
lows  us  to  find  an  optimal  coloring  of  each  component 
C  by  brute-force:  elect  a  leader,  which  will  then  obtain 
complete  information  about  C  and  compute  an  optimal 
edge  coloring  of  C.  We  conjecture  that  this  property 
is  true  of  the  previous  algorithm  also,  but  the  analy¬ 
sis  is  complicated  by  the  dependency  among  the  edges. 
Recall  that  Luby’s  algorithm  is  invoked  in  Part  (II)  of 
the  previous  algorithm  when  the  degree  of  the  graph  is 
0(]og^'^^  n);  at  this  point  we  can  use  the  new  algorithm, 
instead  of  Luby’s,  without  changing  the  the  polyloga- 
rithmic  complexity  of  the  whole  algorithm. 

A  brief  sketch  of  the  second  algorithm  follows.  The 
algorithm  takes  0(A  +  logo)  time,  but  uses  at  most 
(e/(e  —  1)  +  €)A  +  O.dlog*'''^*  n  colors  w.h.p.,  if  A  > 
log'+2^  n  (6  >  0).  Denote  A  —  (^)(1  -  €)log*'*’'*n  by 
/(A,  €,6).  Given  an  upper  bound  A  on  the  maximum 
degree  of  the  graph  and  e,  ^  >  0,  a  A-phase  is  as  follows: 

Repeat  log'"^*  n  times:  Every  vertex  v  6  B 
with  degree  at  least  f{A,€,6)  picks  a  random 
edge  in  5(u)  as  a  candidate.  Now,  every  vertex 
u  &  A  picks  one  of  the  candidate  edges  incident 
to  it  (if  any)  at  random  as  a  winner.  The  same 
fresh  color  is  given  to  the  winners,  which  are 
now  deleted  from  the  graph. 

The  edge  coloring  algorithm: 

(1)  Repeat  until  A  <  log*"*"^^  n: 

•  Execute  a  A-phase. 

•  Replace  A  by  f(A,(,S). 

(2)  Use  Luby’s  algorithm  to  color  the  remaining  graph. 

The  key  claim  is  that  if  A  >  log*'*'^^  n  and  is  an  up¬ 
per  bound  on  the  maximum  degree  of  the  graph,  then 


w.h.p.,  the  maximum  degree  of  the  graph  after  execut¬ 
ing  a  A-phase  is  at  most  /(A,  e,  6).  Also,  if  we  stu^t 
with  a  graph  G  where  A(G)  >  log’"*'^^'*'*  n  with  6'  >  0, 
then  the  diameter  of  any  connected  component  of  the 
graph  remaining  after  Step  (1)  above  is  at  most  O(log  n) 
w.h.p.,  and  hence  the  brute-force  approach  can  be  used 
in  Step  (2)  above,  instead  of  Luby’s  algorithm. 

4  Edge  Coloring  General 
Graphs 

In  this  section,  we  give  a  brief  sketch  of  a  Monte  Carlo 
distributed  algorithm  for  edge  coloring  general  graphs. 
Several  technical  details  have  been  omitted  for  concise¬ 
ness,  and  will  appear  in  the  final  version  of  this  pa¬ 
per.  The  algorithm  is  a  recursive  procedure  based  on 
an  idea  of  Karloff  &  Shmoys  [10],  and  uses  our  bi¬ 
partite  edge  coloring  algorithm  as  a  subroutine.  It 
can  be  applied  on  any  graph  with  maximum  degree 
A  >  log^  n.  It  runs  in  O(logn)  time,  and  uses  at  most 
(e/(c  -  1)  +  c)A  +  (2  -  (e/(e  -  1)  -  f))log^'''*  n  colors 
w.h.p.,  for  any  given  >  0.  Next,  we  state  without 
proof  that  if  A  >  log'*'*'^  n  for  some  S'  >  0,  then  the 
algorithm  can  be  made  to  use  at  most  (e/(e  —  1)  -I-  £)A 
colors  w.h.p.  (the  failure  probability  will  depend  on  S', 
but  will  be  asymptotically  less  than  any  inverse  polyno¬ 
mial  in  n,  for  any  fixed  6'  >  0). 

The  idea  of  the  algorithm,  as  in  [10],  is  to  first  com¬ 
pute  a  random  partition  of  the  vertices  o(  G  =  (V,E) 
into  black  and  white  vertices,  with  each  vertex  deciding 
to  go  to  one  of  the  two  subsets  independently  by  the  toss 
of  a  coin.  Next,  we  color  the  bipartite  graph  induced 
by  the  edges  with  endpoints  of  different  colors  using  our 
algorithm  (note  that  every  vertex  knows  which  stde  of 
the  bipartite  graph  it  is  in),  and  then  recurse  on  the 
two  disconnected  induced  subgraphs  (the  one  induced 
by  black  vertices  and  the  one  induced  by  white  vertices) 
using  the  same  sample  of  fresh  new  colors  on  both.  The 
key  idea  is  that  the  maximum  degree  of  each  of  these 
three  subgraphs  is  at  most  A/2-}- A”  w.h.p.,  for  any 
€  >  0. 
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Assuming  that  this  “degree-halving”  is  true,  the 
maximum  degree  becomes  log^"*"*  n  in  O(log  A)  itera¬ 
tions  at  which  point,  we  can  apply  Luby’s  algorithm 
to  the  resulting  graphs  Gr  to  get  (2A(Gr)  -  1)  edge 
colorings  of  these  graphs. 

In  fact,  we  can  do  better,  if  the  initial  degree  A  > 
log^'*’*  n  for  some  6'  >  0.  A  consequence  of  the  Karloff 
fc  Shmoys  partition  idea  is  that  the  diameters  of  the  in¬ 
duced  subgraphs  of  the  white  and  black  vertices  shrink 
at  an  exponential  rate  w.h.p.,  and  hence  after  0(log  A) 
iterations,  it  becomes  O(logn)  w.h.p.  (proofs  omitted). 
This  allows  us  to  apply  a  brute-  force  search  and  color 
the  finally  remaining  graphs  Gr  optimally  (with  A(Gr) 
or  A(Gr)  +  1  colors,  instead  of  2A(Gr)  —  1  colors). 
Hence,  the  number  of  colors  used  in  this  case  is  at  most 
(e/(e  —  1)  -b  f)A,  w.h.p.  Though  this  is  not  a  big  re¬ 
duction  in  the  number  of  colors  used,  we  think  that  this 
“diameter  shrinking”  is  an  important  property  of  the 
Karloff  &  Shmoys  algorithm,  and  should  be  useful  in 
other  contexts. 
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Abstract 

The  correctness  of  most  randomized  distributed  algo¬ 
rithms  is  expressed  by  a  statement  of  the  form  “some 
predicate  of  the  executions  holds  with  high  probabil- 
ity,  regardless  of  the  order  in  which  actions  are  sched¬ 
uled”  .  In  this  paper,  we  present  a  general  methodol¬ 
ogy  to  prove  correctness  statements  of  such  random¬ 
ized  algorithms.  Specifically,  we  show  how  to  prove 
such  statements  by  a  series  of  refinements,  which  ter¬ 
minate  in  a  statement  independent  of  the  schedule. 
To  demonstrate  the  subtlety  of  the  issues  involved  in 
this  type  of  analysis,  we  focus  on  Rabin’s  randomized 
distributed  algorithm  for  mutual  exclusion  [6]. 

Surprisingly,  it  turns  out  that  the  algorithm  does 
not  maintain  one  of  the  requirements  of  the  problem 
under  a  certain  schedule.  In  particular,  we  give  a 
schedule  under  which  a  set  of  processes  can  suffer 
lockout  for  arbitrary  long  periods. 

1  Introduction 

1.1  General  Considerations 

For  many  distributed  system  problems,  it  is  possi¬ 
ble  to  produce  randomized  algorithms  that  are  bet¬ 
ter  than  their  deterministic  counterparts:  they  may 
be  more  efficient,  have  simpler  structure,  and  even 
achieve  correctness  properties  that  deterministic  al- 
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gorithms  cannot.  One  cost  of  using  randomization  is 
the  increased  difficulty  of  proving  correctness  of  the 
resulting  algorithms.  A  randomized  algorithm  typi¬ 
cally  involves  two  different  types  of  nondeterminism 
-  that  arising  from  the  random  choices  and  that  aris¬ 
ing  from  an  adversary.  The  interaction  between  these 
two  kinds  of  nondeterminism  complicates  the  analysis 
of  the  algorithm. 

In  the  distributed  system  model  considered  here, 
each  of  a  set  of  concurrent  processes  executes  its  lo¬ 
cal  code  and  communicates  with  the  others  through 
a  shared  variable.  The  code  can  contain  random 
choices,  which  leads  to  probabilistic  branch  points  in 
the  tree  of  executions.  By  assumption,  the  algorithm 
is  provided  at  certain  points  of  the  execution  with 
random  inputs  having  known  distributions.  We  can 
equivalently  consider  that  all  random  choices  made 
in  a  single  execution  are  given  by  a  parameter  u  at 
the  onset  of  the  execution.  The  parameter  u>  thus 
captures  the  first  type  of  nondeterminism. 

For  the  second  type,  we  here  define  the  adversary 
A  to  be  the  entity  controlling  the  order  in  which  pro¬ 
cesses  take  steps.  (In  other  work  (e.g.,  [4J),  the  adver¬ 
sary  can  control  other  decisions,  such  as  the  contents 
of  some  messages.)  An  adversary  A  bases  its  choices 
on  the  knowledge  it  holds  about  the  prior  execution 
of  the  system.  This  knowledge  varies  according  to 
the  specifications  for  each  given  problem.  In  this  pa¬ 
per,  we  will  consider  an  adversary  allowed  to  observe 
only  certain  “external”  manifestations  of  the  execu¬ 
tion  and  having  no  access,  for  example,  to  informa¬ 
tion  about  local  process  states.  We  will  say  that  an 
adversary  is  admissible  to  emphasize  its  specificity. 

These  two  sources  of  nondeterminism,  u>  and  A, 
uniquely  define  an  execution  €  =  €{u,A)  of  the  al¬ 
gorithm. 

Among  the  correctness  properties  one  often  wishes 
to  prove  for  randomized  algorithms  are  properties 
that  state  that  a  certain  property  W  of  executions 
has  a  “high”  probability  of  holding  against  all  ad¬ 
missible  adversaries.  Note  that  the  probability  men- 
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tioned  in  this  statement  is  taken  with  respect  to  a 
probability  distribution  on  executions.  One  of  the 
major  sources  of  complication  is  that  there  are  two 
probability  spaces  that  need  to  be  considered:  the 
space  of  random  inputs  u  and  the  space  of  random 
executions.  Let  dP  denote  the  probability  measure 
given  for  the  space  of  random  inputs  u. 

Since  the  evolution  of  the  system  is  determined 
both  by  the  (random)  choices  expressed  by  w  and  also 
by  the  adversary  A,  we  do  not  have  a  single  probabil¬ 
ity  distribution  on  the  space  of  all  executions.  Rather, 
for  each  adversary  A  there  is  a  corresponding  distri¬ 
bution  dP^  on  the  executions  “compatible  with”  A. 
High  probability  correctness  properties  of  a  random¬ 
ized  algorithm  C  are  then  generally  stated  in  terms 
of  the  distributions  dPj^,  in  the  following  form.  Let 
W  and  I  be  sets  of  executions  of  C  and  let  i  be  a 
real  number  in  [0, 1].  Then  C  is  correct  provided  that 
P^[W  I  7]  >  /  for  every  admissible  adversary  A-  For 
a  condition  expressed  in  this  form,  we  think  of  VF  as 
the  set  of  “good”  (or  “winning”)  executions,  while  / 
is  a  set  that  expresses  the  assumptions  under  which 
the  good  behavior  is  supposed  to  hold. 

In  general,  it  is  difficult  to  calculate  (good  bounds 
on)  probabilities  of  the  form  Pa\W\I].  This  is  be¬ 
cause  the  probability  that  the  execution  is  in  W,  I 
or  IF  n  /  depends  on  a  combination  of  the  choices 
in  u>  and  those  made  by  the  adversary.  Although  we 
assume  a  basic  probability  distribution  P  for  w,  the 
adversary’s  choices  are  determined  in  a  more  compli¬ 
cated  way  -  in  terms  of  certain  kinds  of  knowledge 
of  the  prior  execution.  In  particular,  the  adversary’s 
choices  can  depend  on  the  outcomes  of  prior  random 
choices  made  by  the  processes. 

The  situation  is  much  simpler  in  the  special  case 
where  the  events  W  and  I  are  defined  directly  in 
terms  of  the  choices  in  u;.  In  this  case,  the  desired 
probability  can  be  calculated  just  by  using  the  as¬ 
sumed  probability  distribution  dP. 

Our  general  methodology  for  proving  a  high  prob¬ 
ability  correctness  property  of  the  form  Pa[W\1]  con¬ 
sists  of  proving  successive  lower  bounds; 

Pa[W\I]  >  PA[Wi\h] 

>  Pj,[Wr\Ir], 

where  all  the  Wi  and  /,-  are  sets  of  executions,  and 
where  the  last  two  sets,  Wr  and  Ir,  are  defined  di¬ 
rectly  in  terms  of  the  choices  in  u.  The  final  term, 
P.4(tVr  I  Ir],  is  then  evaluated  (or  bounded  from  be¬ 
low)  using  the  distribution  dP.  This  methodology  can 
be  difficult  to  implement  as  it  involves  disentangling 
the  ways  in  which  the  random  choices  made  by  the 
processes  affect  the  choices  made  by  the  adversary. 


This  paper  is  devoted  to  emphasizing  the  need  of 
such  a  rigorous  methodology  in  correctness  proofs: 
in  the  context  of  randomized  algorithms  the  power 
of  the  adversary  is  generally  hard  to  analyze  and  im¬ 
precise  arguments  can  easily  lead  to  incorrect  state¬ 
ments. 

As  evidence  supporting  our  point,  we  give  an  anal¬ 
ysis  of  Rabin’s  randomized  distributed  algorithm  [6] 
implementing  mutual  exclusion  for  n  processes  using 
a  read-modify-write  primitive  on  a  shared  variable 
with  O(log  n)  values.  Rabin  claimed  that  the  al¬ 
gorithm  satisfies  the  following  correctness  property: 
for  every  adversary,  any  process  competing  for  en¬ 
trance  to  the  critical  section  succeeds  with  probabil¬ 
ity  n(l/m),  where  m  is  the  number  of  competing  pro¬ 
cesses.  As  we  shstll  see,  this  property  czui  be  expressed 
in  the  general  form  Pa\W  |  J]  >  /■  In  [5],  Sharir  et 
al.  gave  another  analysis  of  the  algorithm,  providing 
a  formal  model  in  terms  of  Markov  chains;  however, 
they  did  not  make  explicit  the  influence  of  the  adver¬ 
sary  on  the  probability  distribution  on  executions. 

We  show  that  this  influence  is  crucial:  the  adver¬ 
sary  in  [6]  is  much  stronger  than  previously  thought, 
and  in  fact,  the  high  probability  correctness  result 
claimed  in  [6]  does  not  hold. 


The  problem  of  mutual  exclusion  [2]  involves  allocat¬ 
ing  an  indivisible,  reusable  resource  among  n  com¬ 
peting  processes.  A  mutual  exclusion  algorithm  is 
said  to  guarantee  progress^  if  it  continues  to  allo¬ 
cate  the  resource  as  long  as  at  least  one  process  is 
requesting  it.  It  guarantees  no-lockout  if  every  pro¬ 
cess  that  requests  the  resource  eventually  receives  it. 
A  mutual  exclusion  algorithm  satisfies  bounded  wait¬ 
ing  if  there  is  a  fixed  upper  bound  on  the  number  of 
times  any  competing  process  can  be  bypassed  by  any 
other  process.  In  conjunction  with  the  progress  prop¬ 
erty,  the  bounded  waiting  property  implies  the  no¬ 
lockout  property.  In  1982,  Burns  et  al.[l]  considered 
the  mutual  exclusion  algorithm  in  a  distributed  set¬ 
ting  where  processes  communicate  through  a  shared 
read-modify-write  variable.  For  this  setting,  they 
proved  that  any  deterministic  mutual  exclusion  algo¬ 
rithm  that  guarantees  progress  and  bounded  waiting 
requires  that  the  shared  variable  take  on  at  least  n 
distinct  values.  Shortly  thereafter,  Rabin  published 
a  randomized  mutual  exclusion  algorithm  [6]  for  the 
same  shared  memory  distributed  setting.  His  algo¬ 
rithm  guarantees  progress  using  a  shared  variable 
that  takes  on  only  O(log  n)  values. 

It  is  quite  easy  to  verify  that  Rabin’s  algorithm 

*  We  give  more  formal  definitions  of  these  properties  in  Sec¬ 
tion  2. 


1.2  Rabin’s  Algorithm 
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guarantees  mutual  exclusion  and  progress;  in  addi¬ 
tion,  however,  Rabin  claimed  that  his  algorithm  sat¬ 
isfies  the  following  informally-stated  strong  no-lockout 
property^ . 

“//  process  i  participates  in  a  trying  round 
of  a  run  of  a  computation  by  the  protocol 
and  compatible  with  the  adversary,  together 
withO  <  m— 1  <  n  other  processes,  then  the 
probability  that  i  enters  the  critical  region  at 
the  end  of  that  round  is  at  least  c/m,  c  ^ 
2/3.”  (*) 

This  property  says  that  the  algorithm  guarantees 
an  approximately  equal  chance  of  success  to  all  pro¬ 
cesses  that  compete  at  the  given  round.  Rabin  argued 
in  [6]  that  a  good  randomized  mutual  exclusion  algo¬ 
rithm  should  satisfy  this  strong  no-lockout  property, 
and  in  particular,  that  the  probability  of  each  process 
succeeding  should  depend  inversely  on  m,  the  num¬ 
ber  of  actual  competitors  at  the  given  round.  This 
dependence  on  m  was  claimed  to  be  an  important  ad¬ 
vantage  of  this  algorithm  over  another  algorithm  de¬ 
veloped  by  Ben-Or  (also  described  in  [6]);  Ben-Or’s 
algorithm  is  claimed  to  satisfy  a  weaker  no-lockout 
property  in  which  the  probability  of  success  is  approx¬ 
imately  c/n,  where  n  is  the  total  number  of  processes, 
i.e.,  the  number  of  potential  competitors. 

Rabin’s  algorithm  uses  a  randomly-chosen  round 
number  to  conduct  a  competition  for  each  round. 
Within  each  round,  competing  processes  choose  lot¬ 
tery  numbers  randomly,  according  to  a  truncated  ge¬ 
ometric  distribution.  One  of  the  processes  drawing 
the  largest  lottery  number  for  the  round  wins.  Thus, 
randomness  is  used  in  two  ways  in  this  algorithm; 
for  choosing  the  round  numbers  and  choosing  the  lot¬ 
tery  numbers.  The  detailed  code  for  this  algorithm 
appears  in  Figure  1. 

We  begin  our  analysis  by  presenting  three  differ¬ 
ent  formal  versions  of  the  no-lockout  property.  These 
three  statements  are  of  the  form  discussed  in  the  in¬ 
troduction  and  give  lower  bounds  on  the  (conditional) 
probability  that  a  participating  process  wins  the  cur¬ 
rent  round  of  competition.  They  differ  by  the  nature 
of  the  events  involved  in  the  conditioning  and  by  the 
values  of  the  lower  bounds. 

Described  in  this  formal  style,  the  strong  no¬ 
lockout  property  claimed  by  Rabin  involves  condi¬ 
tioning  over  m,  the  number  of  participating  processes 
in  the  round.  We  show  in  Theorem  3.1  that  the  ad- 

^In  the  statement  of  this  property,  a  "trying  round”  refers  to 
the  interval  between  two  successive  allocations  of  the  resource, 
and  the  "critical  region"  refers  to  the  interval  during  which  a 
particular  process  has  the  resource  allocated  to  it.  A  "critical 
region”  is  also  called  a  "critical  section" . 


versary  can  use  this  fact  in  a  simple  way  to  lock  out 
any  process  during  any  round. 

On  the  other  hand,  the  weak  c/n  no-lockout  prop¬ 
erty  that  was  claimed  for  Ben-Or’s  algorithm  involves 
only  conditioning  over  events  that  describe  the  knowl¬ 
edge  of  the  adversary  at  the  end  of  previous  round. 
We  show  in  Theorems  3.2  and  3.4  that  the  algorithm 
suffers  from  a  different  flaw  which  bars  it  from  satis¬ 
fying  even  this  property. 

We  discuss  here  informally  the  meaning  of  this  re¬ 
sult.  The  idea  in  the  design  of  the  algorithm  was  to 
incorporate  a  mathematical  procedure  within  a  dis¬ 
tributed  context.  This  procedure  allows  one  to  se¬ 
lect  with  high  probability  a  unique  random  element 
from  any  set  of  at  most  n  elements.  It  does  so  in 
an  efficient  way  using  a  distribution  of  small  support 
(“small”  means  here  0(log  n))  and  is  very  similar 
to  the  approximate  counting  procedure  of  [3].  The 
mutual  exclusion  problem  in  a  distributed  system  is 
also  about  selecting  a  unique  element;  specifically  the 
problem  is  to  select  in  each  trying  round  a  unique 
process  among  a  set  of  competing  processes.  In  order 
to  use  the  mathematical  procedure  for  this  end  and 
select  a  true  random  participating  process  at  each 
round  and  for  all  choices  of  the  adversary,  it  is  neces¬ 
sary  to  discard  the  old  values  left  in  the  local  variables 
by  previous  calls  of  the  procedure.  (If  not,  the  adver¬ 
sary  could  take  advantage  of  the  existing  values.)  For 
this,  another  use  of  randomness  was  designed  so  that, 
with  high  probability,  at  each  new  round,  all  the  par¬ 
ticipating  processes  would  erase  their  old  values  when 
taking  a  step. 

Our  results  demonstrate  that  this  use  of  random¬ 
ness  did  not  actually  fulfill  its  purpose  and  that  the 
adversary  is  able  in  some  instances  to  use  old  lottery 
values  and  defeat  the  algorithm. 

In  Theorem  3.5  we  show  that  the  two  flaws  re¬ 
vealed  by  our  Theorems  3.1  and  3.2  are  at  the  center 
of  the  problem:  if  one  restricts  attention  to  execu¬ 
tions  where  program  variables  are  reset,  and  if  we 
disallow  the  adversary  to  use  the  strategy  revealed  by 
Theorem  3.1  then  the  strong  bound  does  hold.  Our 
proof  highlights  the  general  difficulties  encountered 
in  our  methodology  when  attempting  to  disentangle 
the  probabilities  from  the  influence  of  A. 

The  algorithm  of  Ben-Or  which  is  presented  at  the 
end  of  [6]  is  a  modification  of  Rabin’s  algorithm  that 
uses  a  shared  variable  of  constant  size.  All  the  meth¬ 
ods  that  we  develop  in  the  analysis  of  Rabin’s  al¬ 
gorithm  apply  to  this  algorithm  and  establish  that 
Ben-Or’s  algorithm  is  similarly  flawed  and  does  not 
satisfy  the  l/2en  no-lockout  property  claimed  for  it 
in  [6].  Actually,  in  this  setting,  the  shared  variables 
can  take  only  two  values,  which  allows  the  zulversary 
to  lock  out  processes  with  probability  one,  as  we  show 
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in  Theorem  3.8. 

In  a  recent  paper  [7],  Kushilevitz  and  Rabin  use  our 
results  to  produce  a  modification  of  the  algorithm, 
solving  randomized  mutual  exclusion  with  log2^n  val¬ 
ues.  They  solve  the  problem  revealed  by  our  Theo¬ 
rem  3.1  by  conducting  before  round  k  the  competition 
that  results  in  the  control  of  Crit  by  the  end  of  round 
k.  And  they  solve  the  problem  revealed  by  our  The¬ 
orem  3.2  by  enforcing  in  the  code  that  the  program 
variables  are  reset  to  0. 

The  remainder  of  this  paper  is  organized  as  follows. 
Section  2  contains  a  description  of  the  mutual  exclu¬ 
sion  problem  and  formal  definitions  of  the  strong  and 
weak  no-lockout  properties.  Section  3  contains  our 
results  about  the  no-lockout  properties  for  Rabin’s 
algorithm.  It  contains  Theorems  3.1  and  3.2  which 
disprove  in  different  ways  the  strong  and  weak  no¬ 
lockout  properties  and  Theorem  3.5  whose  proof  is 
is  a  model  for  our  methodology:  a  careful  analysis  of 
this  proof  reveals  exactly  the  origin  of  the  flaws  stated 
in  the  two  previous  theorems.  One  of  the  uses  of  ran¬ 
domness  in  the  algorithm  was  to  disallow  the  adver¬ 
sary  from  knowing  the  value  of  the  program  variables. 
Our  Theorems  3.2  and  3.7  express  that  this  objective 
is  not  reached  and  that  the  adversary  is  able  to  in¬ 
fer  (partially)  the  value  of  all  the  fields  of  the  shared 
variable.  Theorem  3.8  deals  about  the  simpler  setting 
of  Ben-Or’s  algorithm. 

Some  mathematical  properties  needed  for  the  con¬ 
structions  of  Section  3  are  presented  in  an  appendix 
(Section  4). 


2  The  Mutual  Exclusion  Prob¬ 
lem 

The  problem  of  mutual  exclusion  is  that  of  continu¬ 
ally  arbitrating  the  exclusive  ownership  of  a  resource 
among  a  set  of  competing  processes.  The  set  of  com¬ 
peting  processes  is  taken  from  a  universe  of  size  n  and 
changes  with  time.  A  solution  to  this  problem  is  a 
distributed  algorithm  described  by  a  program  (code) 
C  having  the  following  properties.  All  involved  pro¬ 
cesses  run  the  same  program  C.  C  is  partitioned  into 
four  regions,  Try,  Crit,  Exit,  and  Rem  which  are 
run  cyclically  in  this  order  by  all  processes  executing 
C.  A  process  in  Crit  is  said  to  hold  the  resource.  The 
indivisible  property  of  the  resource  means  that  at  any 
point  of  an  execution,  at  most  one  process  should  be 
in  Crit. 


2.1  Definition  of  Runs,  Rounds,  and 
Adversaries 

In  this  subsection,  we  define  the  notions  of  ran,  round, 
adversary,  and  fair  adversary  which  we  will  use  to 
define  the  properties  of  progress  and  no-lockout. 

A  ran  p  of  n  (partial)  execution  5  is  a  se¬ 
quence  of  triplets  {(pi,oldi,newi),  (p2,old2,new2), 
...  {pt,oldt,newt)  . . .}  indicating  that  process  pt 
takes  the  t*^  step  in  €  and  undergoes  the  region 
change  oldt  —*  neu/,  during  this  step  (e.g.,  oldt  = 
newt  =  Try  or  oldt  =  Try  and  newt  =  Crit).  We 
say  that  €  is  compatible  with  p. 

An  admissible  adversary  for  the  mutual  exclusion 
problem  is  a  mapping  A  from  the  set  of  finite  runs 
to  the  set  {1, . . .  ,n)  that  determines  which  process 
takes  its  next  step  as  a  function  of  the  current  par¬ 
tial  run.  That  is,  the  adversary  is  only  zdlowed  to 
see  the  changes  of  regions.  For  every  t  and  for  ev¬ 
ery  run  p  -  {{pi,oldi,newi),  [jh2,old2,new2),  ■  ■  ■}, 
A[{{pi,oldi,newi),...,{pt,old,,new,)}]  =  p,+i.  We 
then  say  that  p  and  A  are  compatible. 

An  adversary  A  is  fair  if  for  every  execution,  every 
process  i  in  Try,  Crit,  or  Exit  is  eventually  provided 
by  A  with  a  step.  This  condition  describes  “normal” 
executions  of  the  algorithm  and  says  that  processes 
can  quit  the  competition  only  in  Rem. 

A  round  of  an  execution  is  the  part  between  two 
successive  entrances  to  the  critical  section  (or  before 
the  first  entrance).  Formally,  it  is  a  maximal  execu¬ 
tion  fragment  of  the  given  execution,  containing  one 
transition  Try  —*  Crit  at  the  end  of  this  fragment 
and  no  other  transition  Try  — ►  Crit.  The  round  of  a 
run  is  defined  similarly. 

A  process  i  participates  in  a  round  if  t  takes  a  step 
while  being  in  its  trying  section  TYy. 

2.2  The  Progress  and  No-Lockout 
Properties 

Definition  2.1  An  algorithm  C  that  solves  mutual  ex¬ 
clusion  guarantees  progress  if,  for  all  fair  adversaries, 
there  is  no  infinite  execution  in  which,  from  some  point 
on,  at  least  one  process  is  in  its  Try  region  (respec¬ 
tively  its  Exit  region)  and  no  transition  Try  — ►  Crit 
(respectively  Exit  — ►  Rem)  occurs. 

The  properties  that  we  considered  thus  far  are  non- 
probabilistic.  The  no-lockout  property  is  probabilis¬ 
tic.  Its  formal  definition  requires  the  following  nota¬ 
tion: 

Let  X  denote  any  generic  quantity  whose  value 
changes  as  the  execution  unfolds  (e.g.,  a  program 
variable).  We  let  X{k)  denote  the  value  of  X  just 
prior  to  the  last  step  (Try  — ►  Crit)  of  the  kth  round 
of  the  execution.  As  a  special  case  of  this  general 
notation,  we  define  the  following. 
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V(k)  is  the  set  of  participating  processes  in  round 

k.  (Set  Vik)  =  0  if  f  has  fewer  then  k  rounds.)  The 
notation  V{k)  is  consistent  with  the  general  notation 
because  the  set  of  processes  participating  in  round  k  is 
updated  as  round  k  progresses:  in  effect  the  definition 
of  this  set  is  complete  only  at  the  end  of  round  k  (this 
fact  is  at  the  heart  of  our  Theorem  3.1). 

t{k)  is  the  total  number  of  steps  that  are  taken  by 
all  the  processes  up  to  the  end  of  round  k. 

M{k)  is  the  set  of  executions  in  which  all  the  pro¬ 
cesses  j  participating  in  round  k  reinitialize  their  pro¬ 
gram  variables  Bj  with  a  new  value  0j{k)  during 
round  k.  stands  for  New-values.)  0j{k)\  k  = 

l, 2,...,  j  =  1, . . . ,  n  is  a  family  of  iid  ^  random 
variable  whose  distribution  is  geometric  truncated  at 
log2n  +  4  (see  [6]). 

For  each  j  and  k,  we  let  Wi(k)  denote  the  set  of 
executions  in  which  process  i  enters  the  critical  region 
at  the  end  of  round  k. 

We  consistently  use  the  probability  theory  conven¬ 
tion  according  to  which,  for  any  property  S,  the  set 
of  executions  {£  :  £  has  property  5}  is  denoted  as 
{5}.  Then: 

•  For  each  step  number  t  and  each  execution  £ 
we  let  irt{£)  denote  the  run  compatible  with  the 
first  t  steps  of  £.  For  any  <-steps  run  p,  (jr,  =  p) 
represents  the  set  of  executions  compatible  with 
p.  ({x,  =  p}  =  0  if  p  has  fewer  then  t  steps.)  We 
will  use  Xfc  in  place  of  ir,(fc)  to  simplify  notation. 

•  Similarly,  for  all  m  <  n,  {I'P(k)|  =  m}  repre¬ 
sents  the  set  of  executions  having  m  processes 
participating  in  round  k. 

The  quantities  Af{k),{irt  =  p},  Wi{k),  {|'P(ifc)l  = 
m)  are  sets  of  executions:  for  a  given  adversary  they 
are  random  events  in  the  probability  space  of  random 
executions  endowed  with  the  measure  dP^. 

We  now  present  the  various  no-lockout  properties 
that  we  want  to  study.  A  first  question  is  to  char¬ 
acterize  relevant  events  I  over  which  conditioning 
should  be  done.  Note  first  that  restricting  the  set 
of  executions  to  the  ones  having  a  certain  property 
amounts  to  conditioning  on  this  property.  In  par¬ 
ticular,  we  will  condition  on  the  fact  that  process  i 
participates  in  round  k.  A  crucial  remark  is  that,  in 
the  worst  case  adversary  framework  that  we  are  in¬ 
terested  in,  the  adversary  minimizing  ^lVi(ib)  |  /j 
will  make  its  choices  as  if  “knowing”  7.  We  will  derive 
telling  consequences  from  this  fact. 

We  have  actually  in  mind  to  compute  the  proba¬ 
bility  of  W,'(ib)  at  different  points  sjt  of  the  execution. 

^Recall  that  iid  stands  for  “independent  and  identically  dis¬ 
tributed”  . 


One  way  to  go,  would  be  to  condition  on  the  past 
execution.  But,  by  our  previous  remark,  this  is  tan¬ 
tamount  to  allow  the  adversary  to  this  knowledge.  It 
is  then  easy  to  see  that  lockout  is  possible.  Another 
natural  alternative  that  we  will  adopt,  is  to  compute 
the  probability  at  point  s*  “from  the  point  of  view 
of  the  adversary” :  this  translates  formally  into  con¬ 
ditioning  over  the  value  of  the  run  up  to  point  s*. 
We  will  say  that  such  a  no-lockout  property  is  run¬ 
knowing. 

The  first  two  definitions  involve  evaluating  the 
probabilities  “at  the  beginning  of  round  k” . 
Definition  2.2  (Weak,  Run-knowing,  Proba¬ 
bilistic  no-lockout)  A  solution  to  the  mutual  exclu¬ 
sion  problem  satisfies  weak,  run-knowing  probabilistic 
no-lockout  whenever  there  exists  a  constant  c  such  that, 
for  every  fair  adversary  A,  every  i  >  1,  every  (fc  —  1)- 
round  run  p  compatible  with  A,  and  every  process  t, 

P.«[w.(ib)  I  Tt_i  =p,ie  P(Jb)]  >  c/n. 

whenever  P^[Tt_i  =  p,  i  6  P(k)]  ^  0  . 

The  next  property  formally  expresses  statement  (*) 
of  Rabin.  As  we  mentioned  in  our  general  presenta¬ 
tion,  considering  rounds  having  m  participating  pro¬ 
cesses  corresponds  to  conditioning  on  this  fact. 
Definition  2.3  (Strong,  Run-knowing,  Proba¬ 
bilistic  no-lockout)  The  same  as  in  Definition  2.2 
except  that: 

PA[Wiik)  I  n.j  =  p,  »  6  Vik),  \Vik)\  =  m]  >  c/m, 

whenever  P,4[n-i  =  P,  »  €  P(ife),  \'P{k)\  =  m]  0  . 

Recalling  the  two  interpretations  of  conditioning  in 
terms  of  time  and  knowledge  held  by  the  adversary, 
we  see  that  this  property  differs  fundamentally  from 
the  previous  one  because,  here,  the  adversary  is  pro¬ 
vided  with  the  number  of  processes  due  to  participate 
in  the  future  round  (i.e.,  after  t(k  —  1)).  By  integra¬ 
tion  over  m,  we  see  that  an  algorithm  satisfying  the 
strong  property  also  satisfies  the  weak  property. 

The  next  definition  is  the  transcription  of  the  pre¬ 
vious  one  for  the  case  where  the  probability  is  “com¬ 
puted  at  the  beginning  of  the  execution”  (i.e.,  s*  =  0 
for  all  k). 

Definition  2.4  (Strong,  Without  knowledge. 
Probabilistic  no-lockout)  The  same  as  in  Defini¬ 
tion  2.2  except  that: 

P>t[w.(ib)  I  i  e  p(jfe),  \V(k)\  =  m]  >  c/m, 

whenever  P>»[i  €  P(k),  |P(^)|  =  m]  ^  0  . 

By  integration  over  p  we  see  that  an  algorithm  hav¬ 
ing  the  property  of  Definition  2.3  is  stronger  then  one 
having  the  property  of  Definition  2.4.  Equivalently, 
an  adversary  able  to  falsify  Property  2.4  is  stronger 
then  one  able  to  falsify  Property  2.3. 
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3  Our  Results 

Here,  we  give  a  little  more  detail  about  the  operation 
of  Rabin’s  algorithm  than  we  gave  earlier  in  the  in¬ 
troduction.  At  each  round  k  a  new  round  number  R 
is  selected  at  random  (uniformly  among  100  values). 
The  algorithm  ensures  that  any  process  t  that  has  al¬ 
ready  participated  in  the  current  round  has  R4  =  R, 
and  so  passes  a  test  that  verifies  this.  The  variable  R 
acts  as  an  “eraser”  of  the  past;  with  high  probability, 
a  newly  participating  process  does  not  pass  this  test 
and  consequently  chooses  a  new  random  number  for 
its  lottery  value  Bj.  The  distribution  used  for  this 
purpose  is  a  geometric  distribution  that  is  truncated 
at  6  =  log2n+4:  P[/e!,  (Jb)  =  /]  =  2"'  for  /  <  6- 1.  The 
first  process  that  checks  that  its  lottery  value  is  the 
highest  obtained  so  far  in  the  round,  at  a  point  when 
the  critical  section  is  unoccupied,  takes  possession  of 
the  critical  section.  At  this  point  the  shared  variable 
is  reinitialized  and  a  new  round  begins. 

The  algorithm  has  the  following  two  features. 
First,  any  participating  process  t  reinitializes  its  vari¬ 
able  Bi  at  most  once  per  round.  Second,  the  pro¬ 
cess  winning  the  competition  takes  at  most  two  steps 
(and  at  least  one)  after  the  point  fk  of  the  round  at 
which  the  critical  section  becomes  free.  Equivalently, 
a  process  t  that  takes  two  steps  after  ft  and  does  not 
win  the  competition  cannot  hold  the  current  maxi¬ 
mal  lottery  value.  (A  process  i  having  already  taken 
a  step  in  round  k  holds  the  current  round  number 
i.e.,  Ri{k)  =  Rik).  On  the  other  hand,  the  semaphore 
S  is  set  to  0  after  /* .  If  i  held  the  highest  lottery  value 
it  would  pass  all  three  tests  in  the  code  and  enter  the 
critical  section.)  We  will  take  advantage  of  this  last 
property  in  our  constructions. 

We  are  now  ready  to  state  our  results.  The  first 
result  states  that  the  strong  0(1 /m)  result  claimed 
by  Rabin  is  incorrect. 

Theorem  3.1  The  algorithm  does  not  have  the 
strong  no-lockout  property  of  Definition  (2.4)  (and 
hence  of  Definition  2.3).  Indeed,  there  is  an  ad¬ 
versary  A  such  that,  for  all  rounds  k,  for  all 
m  <  »  -  1,  P^[l  €  P(*),  \V{k)\  =  m]  5^  0  but 

Px[Wi(jb)  I  1  €  V{k),  \V{k)\  =  m]  =  0. 

Proof:  As  we  already  remarked,  the  worst  case 
adversary  acts  as  if  it  knows  the  events  on  which 
conditioning  is  done.  Knowing  beforehand  that 
the  total  number  of  participating  processes  in  the 
round  is  m  allows  the  adversary  to  design  a  sched¬ 
ule  where  processes  take  steps  in  turn,  where  pro¬ 
cess  1  begins  and  where  process  m  takes  posses¬ 
sion  of  the  critical  section.  Specifically,  the  adver¬ 
sary  A  does  not  use  its  knowledge  about  p,  gives 


Shared  variable:  V  =  (S,  B,  R),  where: 

S  €  {0, 1},  initially  0 
B  €  {0, 1, . . .,  flogn]  +  4},  initially  0 
/i  €  {0, 2, ... ,  99),  initially  random 

Code  for  t: 

Local  variables: 

Bi  €  {0, ... ,  flog”!  +  4),  initially  1 
G  {0, 1, . .  .,99},  initially  L 
Code: 

while  V  ^  (0,  Bi,Ri)  do 

if  (V.R  Ri)  or  {V.B  <  B<)  then 
Bi  *—  random 
V.B  *-  max{V.B,  Bi) 

Ri^V.R 
unlock;  lock; 

V  «—  (1,0,  random) 
unlock; 

*  Critical  Region  ** 
lock; 

Ri^L 

Bii-Q 

unlock; 

*  Remainder  Region  ** 
lock; 


Figure  1 :  Rabin’s  Algorithm 

one  step  to  process  1  while  the  critical  section  is  oc¬ 
cupied,  waits  for  Exit  and  then  adopts  the  sched¬ 
ule  2, 2, 3, 3, . . . ,  n,  n,  1.  This  schedule  brings  round 
k  to  its  end,  because  of  the  second  property  men¬ 
tioned  above  (i.e.,  all  processes  are  scheduled  for  two 
steps).  For  this  adversary,  for  2  <  m  <  n  -  1, 
[^(jfe)!  =  m  happens  exactly  when  process  m  wins 
so  that  Py4[W2(fc)  n  |P(ifc)|  =  m]  =  0.  On  the  other 
hand,  for  this  adversary,  process  m  wins  with  non  zero 
probability,  i.e.,  P^[l  G  V(k)  n  |P(i)|  =  m]  /  0  . 

■ 

The  previous  result  is  not  too  surprising  in  the  light 
of  the  time  interpretation  given  before  Definition  2.2. 
restricting  the  execution  to  {|P(ib)|  =  m)  gives  A  too 
much  knowledge  about  the  future.  We  now  give  in 
Theorem  3.2  the  more  damaging  result,  stating  (1) 
that,  in  spite  of  the  randomization  introduced  in  the 
round  number  variable  R,  the  adversary  is  able  to 
infer  the  values  held  in  the  local  variables  and  (2) 
that  it  is  able  to  use  this  knowledge  to  lock  out  a 
process  with  probability  exponentially  close  to  1. 

Theorem  3.2  There  exists  a  constant  c  <  1,  an  ad¬ 
versary  A,  a  round  k  and  a  k  —  1-round  run  p  such 
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that: 


P^[M^i(ib)  I  ir*_x  =  p,  1  €  Vik)]  <  c-«  +  c". 

We  need  the  following  definition  in  the  proof. 

Definition  3.1  Let  /  be  a  round.  Assume  that,  during 
round  I,  the  adversary  adopts  the  following  strategy.  It 
first  waits  for  the  critical  section  to  become  free,  then 
gives  one  step  to  process  j  and  then  two  steps  (in  any 
order)  to  a  other  processes.  (We  will  call  these  test- 
processes.)  Assume  that  at  this  point  the  critical  section 
is  still  available  (so  that  round  /  is  not  over).  We  then 
say  that  process  j  is  an  s-survivor  (at  round  /). 

The  idea  behind  this  notion  is  that,  by  manufactur¬ 
ing  survivors,  the  adversary  is  able  to  select  processes 
having  high  lottery  values.  We  now  describe  in  more 
detail  the  selection  of  survivors  and  formalize  this  last 
fact. 

In  the  following  we  will  consider  an  adversary  con¬ 
structing  sequentially  a  family  of  s-survivors  for  the 
four  values  a  =  t  =  —  1,...,— 5.  When¬ 

ever  the  adversary  manages  to  select  a  new  survivor 
it  stores  it,  i.e,  does  not  allocates  it  any  further  step 
until  the  selection  of  survivors  is  completed.  (A  ac¬ 
tually  allocates  steps  to  selected  survivors,  but  only 
very  rarely,  to  comply  with  fairness.  Rarely  means 
for  instance  once  every  nT^  steps,  where  T  is  the  ex¬ 
pected  time  to  select  an  r»/2-survivor.)  By  doing  so, 
A  reduces  the  pool  of  test-processes  still  available. 
We  assume  that,  at  any  point  in  the  selection  pro¬ 
cess,  the  adversary  selects  the  test-processes  at  ran¬ 
dom  among  the  set  of  processes  still  available.  (The 
adversary  could  be  more  sophisticated  then  random, 
but  this  is  not  needed.)  Note  that  a  new  s-survivor 
can  be  constructed  with  probability  one  whenever  the 
available  pool  has  size  at  least  s  -I- 1:  it  suffices  to  re¬ 
iterate  the  selection  process  until  the  selection  com¬ 
pletes  successfully. 

Lemma  3.3  There  is  a  constants  d  such  that  for  any 
t  =  — 5, ...,— 1,  for  any  2*°*2'‘+‘-survivor  j,  for  any 
a  =  0,  ...,5 

^A[Bj{l)  =  logn  ■+  f  -f  a]  >  d. 

Proof:  Let  a  denote  logn  -h  t.  Let  j  be  an  s-survivor 
and  ti,t2,...,t«  be  the  test-processes  used  in  its  se¬ 
lection.  Assume  also  that  j  drew  a  new  value  Bj(/)  = 
Pj{l)  (this  happens  with  probability  q\  =  .99  .)  Re¬ 
mark  that  Bj{l)  =  Max{Bi, (/),..., J5i,(/), fly(/)};  if 
this  were  not  the  case,  one  of  the  test-processes  would 
have  entered  Crit.  As  the  test  processes  are  selected 
at  random,  each  of  them  has  with  probability  .99  a 
round  number  different  from  R{1)  and  hence  draws  a 
new  lottery  number  Hence,  with  high  proba¬ 

bility  q2  >  0,  90%  of  them  do  so.  The  other  of  them 


keep  their  old  lottery  value  Bj{l  —  1):  this  value,  be¬ 
ing  old,  has  lost  in  previous  rounds  and  is  therefore 
stochastically  smaller  *  then  a  new  value  (An 

application  of  Lemma  4.5  formalizes  this.)  Hencc, 
with  probability  at  least  qiq2  we  have  the  following 
stochastic  inequality: 

Max{  A  (/),...,  A  •o/ioo} 

<£  Bj(l)  <c  Max{/?i(/) . /?.+i(0}. 

Corollary  4.4  then  shows  that,  for  a  =  0, . . . ,  5,  with 
probability  at  least  qiq2,  P>t[fl;(/)  =  log2s]  >  53  for 
some  constant  93  (93  is  close  to  0.01).  Hence,  with 
probability  at  least  d  =  9i9293)  Bj{l)  is  equal  to 
logjs  -b  o.  ■ 

Proof  of  Theorem  S.S:  The  adversary  uses  a  prepa¬ 
ration  phase  to  select  and  store  some  processes  hav¬ 
ing  high  lottery  values.  We  will,  by  abuse  of  lan¬ 
guage,  identify  this  phase  with  the  round  p  which 
corresponds  to  it.  When  this  preparation  phase  is 
over,  round  k  begins. 

Preparation  phase  p:  For  each  of  the  five  values 
logjO-l-t,  t  =  —5, . . . ,  —1,  A  selects  in  the  preparation 
phase  many  (“many”  means  n/20  for  t  =  —5, . . . ,  — 2 
and  6n/20  for  t  =  —1)  2'‘^‘2'»+*.survivors.  Let  Si  de¬ 
note  the  set  of  all  the  survivors  thus  selected.  (Note 
that  |,?i|  =  n/2  so  that  we  have  enough  processes 
to  conduct  this  selection).  By  partitioning  the  set 
of  2'°82»-i  -survivors  into  six  sets  of  equal  size,  for 
each  of  the  ten  values  t  =  —5,  ...,4,  A  has  then  se¬ 
cured  the  existence  of  n/20  processes  whose  lottery 
value  is  log2n  + 1  with  probability  bigger  then  d.  (By 
Lemma  3.3.) 

Round  k:  While  the  critical  section  is  busy,  A  gives 
a  step  to  each  of  the  n/2  processes  from  the  set  S2 
that  it  did  not  select  in  phase  p.  When  this  is  done, 
with  probability  at  least  1  —  2“^*  (see  Corollary  4.2) 
the  program  variable  B  holds  a  value  bigger  or  equal 
then  log2n  —  5.  The  adversary  then  waits  for  the 
critical  section  to  become  free  and  gives  steps  to  the 
processes  of  Si  it  selected  in  phase  p.  A  process  in 
S2  can  win  access  to  the  critical  section  only  if  the 
maximum  lottery  value  Bs,  ='  «  s,  Bj  of  all 

the  processes  in  S2  is  strictly  less  then  log2n  —  5  or  if 
no  process  of  Si  holds  both  the  correct  round  number 
R{k)  and  the  lottery  number  Bs,  ■  This  consideration 
gives  the  bound  predicted  in  Theorem  3.2  with  c  = 
(1  -  d/100)‘/2“.  ■ 

Our  proof  actually  demonstrates  that  there  is  an 
adversary  that  can  lock  out,  with  probability  expo¬ 
nentially  close  to  1 ,  an  arbitrary  set  of  n/2  processes 

real  random  variable  X  is  stochastically  smaller  then 
another  one  Y  (we  write  that:  X  <c  P)  exactly  when,  for  all 
ar  €  R,  P[X  >  r]  <  P[y  >  x].  Hence,  if  <  F  in  the  usual 
sense,  it  is  also  stochastically  smaller. 
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during  some  round.  With  a  slight  improvement  we 
can  derive  an  adversai  that  will  succeed  in  lock¬ 
ing  out  (with  probability  exponentially  close  to  1) 
a  given  set  S3  of,  for  example,  n/100  processes  at  all 
rounds:  we  just  need  to  remark  that  the  adversary  can 
do  without  this  set  S3  during  the  preparation  phase 
p.  The  adversary  would  then  alternate  preparation 
phases  p\,p2,...  with  rounds  ki,k2,...  The  set  53 
of  processes  would  be  given  steps  only  during  rounds 
ki,k2,...  and  would  be  locked  out  at  each  time  with 
probability  exponentially  close  to  1. 

In  view  of  our  counterexample  we  might  think  that 
increasing  the  size  of  the  shared  variable  might  yield 
a  solution.  For  instance,  if  the  geometric  distribu¬ 
tion  used  by  the  algorithm  is  truncated  at  the  value 
6  =  2  Iog2n  instead  of  log2n  +  4,  then  the  adversary 
is  not  able  as  before  to  ensure  a  lower  bound  on  the 
probability  that  an  n/2-survivor  holds  6  as  its  lot¬ 
tery  value.  (The  probability  is  given  by  Theorem  4.1 
with  X  =  logn.)  Then  the  argument  of  the  previ¬ 
ous  proof  does  not  hold  anymore.  Nevertheless,  the 
next  theorem  establishes  that  raising  the  size  of  the 
shared  variable  does  not  help  as  long  as  the  size  stays 
sub-linear.  But  this  is  exactly  the  theoretical  result 
the  algorithm  was  supposed  to  achieve.  (Recall  the 
n-lower  bound  of  [1]  in  the  deterministic  case.)  Fur¬ 
thermore,  the  remark  made  above  applies  here  also: 
a  set  of  processes  of  linear  size  can  be  locked  out  at 
each  time  with  probability  arbitrarily  close  to  1 . 

Theorem  3.4  Suppose  that  we  modify  the  algorithm 
so  that  the  set  of  possible  round  numbers  used  has  size 
r  and  that  the  set  of  possible  lottery  numbers  has  size 
6  (log2n  -1-  4  <  6  <  n).  Then  there  exists  positive 
constants  cj  and  c^,  an  adversary  A,  and  a  run  p  such 
that 


P.«[Wi(Jt)|xfc_i  =  p,  ieP(ifc)]< 

n-' 

Proof:  We  consider  the  adversary  A  described  in 
the  proof  of  theorem  3.2:  for  t  =  -5, . . . ,  —2,  A  pre¬ 
pares  a  set  Tt  of  2'°*>2"+‘-8urvivors,  each  of  size  n/20, 
and  aset  T_i  of  2'°S2"~ '-survivors;  the  size  of  T_i  is 
6/20n.  (We  can  as  before  think  of  this  set  as  being 
partitioned  into  six  different  sets.)  We  let  rj  stand  for 
6/20  in  the  sequel. 

Let  Pi  denote  the  probability  that  process  1  holds  I 
as  its  lottery  value  after  having  taken  a  step  in  round 
k.  For  any  process  j  in  5_i  let  also  qi  denote  the 
probability  that  process  j  holds  /  as  its  lottery  value 
at  the  end  of  the  preparation  phase  p. 

The  same  reasoning  as  in  Theorem  3.2  then  leads 
to  the  inequality: 

PA[Wi{k)\ift-i=P,  l€'Pik)]< 


e-32  -I-  (1  -  e-32)(l  -  d/r)"/” 

+  E 

/>log2n4-5 

Write  /  =  log2n  -f  *  —  1  =  log2(n/2)  -t-  x.  Then,  as 
is  seen  in  the  proof  of  Corollary  4.4,  qi  = 
for  some  C  G  (x,  a:  1).  For  /  >  log2n  4-  5,  a:  is  at  least 
6  and  ~  1  so  that  qi  ~  2^"^  >  2^“*.  On  the 

other  hand  p;  =  2~‘  =  2~*‘^'/n. 

Define  ^{x)  =  e~^'  so  that  ^^^(a:)  = 

g-2— ,n/r2i-«^„/r.  Then: 


l>iog2+S 


< 


< 


< 


2/n^2-'(l-^^)’'" 

jr>6  ’’ 

2/n5^2“*e-(^’>"> 

*>6 


t>6 


tjn 


t>6 

fOO 


—Mr 

ijn' 

r 

tjn^ 


To  simplify  the  notations  in  the  sequel,  we  will  let 
denote  the  elements  of  T{k).  And  we 
will  let  pi,p2,...  denote  the  sequer.-;  of  processes 
taking  steps  in  turn  during  round  k:  recall  that  a 
process  i  can  take  several  steps  during  the  round. 

The  flaw  of  the  protocol  revealed  in  Theorem  3.2  is 
based  on  the  fact  that  the  variable  R  does  not  act  as 
an  eraser  of  the  past  and  that  the  adversary  can  use 
old  values  to  defeat  the  algorithm.  The  flaw  exhibited 
in  Theorem  3.1  is  based  on  the  fact  that,  even  when 
the  old  values  are  erased,  the  algorithm  is  sensitive  to 
the  order  pi,P2, . . .  in  which  participating  processes 
are  scheduled.  The  adversary  can  play  on  this  order 
in  two  different  ways.  It  can  act  on  the  fact  that  dif¬ 
ferent  scheduling  strategies  influence  in  different  ways 
the  size  m  of  the  set  ■P(ifc)  (Strategy  1).  And  it  can 
use  the  fact  that,  for  a  given  number  m  of  participat¬ 
ing  processes,  the  mathematical  distribution  of  the 
sequence  {0i(k);  i  €  Vik))  is  (a  priori)  sensitive  to 
the  ordering  pi,P2,  •  ■ .  (Strategy  2).  The  adversary  of 
Theorem  3.1  specifically  used  strategy  1. 

The  next  result  shows  that  the  two  flaws  exhibited 
in  Theorems  3.1  and  3.2  are  at  the  core  of  the  prob- 
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lem:  the  algorithm  does  have  the  strong  no-lockout 
property  when  we  precondition  on  the  fact  that  the 
internal  variables  of  the  participating  processes  are 
reset  to  new  values  and  when  we  bar  the  adversary 
from  using  strategy  1.  We  will  actually  prove  this  re¬ 
sult  for  a  slightly  modified  version  of  the  algorithm. 
Recall  in  effect  that  the  code  given  in  Page  6  is  opti¬ 
mized  by  making  a  participating  process  t  draw  a  new 
lottery  number  when  it  is  detected  that  V.B  <  Bi. 

We  will  consider  the  “de-optimized”  version  of  the 
code  in  which  only  the  test  V.R  ^  Ri  ?  causes  of  a 
new  drawing  to  occur. 

The  next  definition  formalizes  the  restriction  that 
we  impose  on  the  adversary.  It  says  that  the  adver¬ 
sary  commits  itself  to  the  value  of  'P(ib)  at  the  begin¬ 
ning  of  round  k. 

Definition  3.2  We  say  that  an  adversary  is  restricted 
when,  for  each  round,  it  allocates  a  step  to  all  participat¬ 
ing  processes  (of  this  round)  before  the  critical  section 
becomes  free.  We  will  let  .4'  (as  opposed  to  ^4)  denote 
any  such  adversary. 

We  will  make  constant  use  of  the  notation  [n]  1= 
{1,2,  ...,n}.  Also,  for  any  sequence  (aj)jeN  we  will 
write  Ui  =  Umaxa,-  to  mean  that  i  is  the  only  index 

j  €  J 

in  J  for  which  a,  =  Max  a,-. 

ie  j 

Theorem  3.5  For  every  process  i  =  1, ...,  n,  for  ev¬ 
ery  round  jfc  >  1,  for  every  restricted  adversary  A'  and 
for  every  {k  —  l)-round  run  p  compatible  with  A', 
P>»'lIVi(jb)  1  Af{k),  n-i  =  P,ie  V{k),  |P(fc)l  =  m] 

>  whenever 

=  p,ie  V{k),  \V{k)\  =  m]  /  0  . 

Proof: 

We  first  define  the  events  U{k)  and  Uj{k),  where  J 
is  any  subset  of  {1, ... ,  n}: 

U{k)  "=  {3!i  G  V{k)  s.t.  Bi{k)  =  5>(fc)}, 

U'j{k)  =  {3!i  G  J  s.t.  A(ifc)  =  Max/?j(ib)}. 

The  main  result  established  in  [6]  can  formally  be 
restated  as: 

Vm<n,  P[W^,  (*)]  >2/3.  (1) 

Following  the  general  proof  technique  described  in  the 
introduction  we  will  prove  that  : 

PA{u{k)  I  Ak),  n.i  =  p,  f  G  V{k),  \V(k)\  =  m] 

=  p|i/^(i)j  ,  and  that: 

PA'\Wi(k)  1  Afik),  =  p,  i  G  Vik),  \V{k)\  =  m,W(I:)] 

=  pU(k)=  M^  /3,(k)l  UAk)]  . 

The  events  involved  in  the  LHS  of  the  two  inequal¬ 
ities  (e.g.,  W,(b),  U(k),  {\'P(k)\  =  m),  =  p). 


{i  G  P(it)})  depend  on  A'  whereas  the  events  involved 
in  the  RHS  are  pure  mathematical  events  over  which 
A'  has  no  control. 

We  begin  with  some  important  remarks. 

(1)  By  definition,  the  set  P(/fe)  =  {»i,»2, . . .}  is 

decided  by  the  restricted  adversary  A1  at  the  begin¬ 
ning  of  round  k:  for  a  given  A!  and  conditioned  on 
{xt_i  =  p},  the  set  V{k)  is  defined  deterministically. 
In  particular,  for  any  i,  P^'[  t  G  P(k)  |  itk-i  =  p] 
has  value  0  or  1.  Similarly,  there  is  one  value 
m  for  which  P^i[|P(ib)|  =  m  |  =  p]  =  1  . 

Hence,  for  a  given  adversary  A',  if  the  random  event 
{A/'(fc),  xt_i  =  p,  i  e  V{k),  lP(t)|  =  m}  has 
non  zero  probability,  it  is  equal  to  the  random  event 
{A/(i), 

(2)  Recall  that,  in  the  modified  version  of  the 

algorithm  that  we  consider  here,  a  process  i  draws  a 
new  lottery  value  in  round  k  exactly  when  iii(fc  — 1)  ^ 
R{k).  Hence,  within  I,  the  event  Af{k)  is  equal  to 
{Ri,{k-l):/:  R{k),  ...,RiJk-l)i^  R(k)}.  On  the 
other  hand,  by  definition,  the  random  variables  (in 
short  r.v.s)  G  P(k)  are  iid  and  independent 

from  the  r.v.  R{k).  This  proves  that,  (for  a  given 
A'),  conditioned  on  {irjt_i  =  p},  the  r.v.  Af{k)  is 
independent  from  all  the  r.v.s  .  Note  that  Wp(t)(k) 
is  defined  in  terms  of  (i.e.,  measurable  with  respect 
to)  the  (/?,v;  ij  G  P(k)),  so  that  W^(t)(it)  and  //(k) 
are  also  independent. 

(3)  More  generally,  consider  any  r.v.  X  defined 

in  terms  of  the  (A^ ;  ij  G  Vik)):  X  =  /(/3<, ,  •  •  - ,  ) 

for  some  measurable  function  /.  Recall  once  more 
that  the  number  m  and  the  indices  >i , . . . ,  im  are  de¬ 
termined  by  {irt_i  =  p)  and  A' ■  The  r.v.s  being 
iid,  for  a  fixed  A' ,  X  then  depends  on  =  p) 

only  through  the  value  m  of  lP(k)|.  Formally,  this 
means  that,  conditioned  on  |P(k)|,  the  r.v.s  X  and 
{xt_i  =  p)  are  independent:  E>»<[A'  |  Xi-i  =  p  ]  = 
E^.{A  I  |P(k)|  =  m]  =  E[/(^i , . . . .  /?m)].  (More  pre¬ 
cisely,  this  equality  is  valid  for  the  value  m  for  which 
PA[^k-i  =  P  .  \Vik)\  =  m]  ^  0.)  A  special  conse¬ 
quence  of  this  fact  is  that  P>»'[W^(jfc)(k)  |  Tfk-i  = 

P\  =  p[Wm(*)1- 

Remark  that,  in  W(it),  the  event  W,(ib)  is  the  same 
as  the  event  {B,(ifc)  =  Umaxfl,(ifc)}.  This  justifies 

jer(k) 

the  first  following  equality.  The  subsequent  ones  are 
commented  afterwards.  Also,  the  set  /  that  we  con¬ 
sider  here  is  the  one  having  a  non  zero  probability 
described  in  Remark  (1)  above. 

PA'lWiik)  I  Uik),  I] 

=  PMBiik)=\JmaxBjik)\  Uik),  1] 

=  P^.[/?,(k)=  ym^^/?,(k)|  W^(^)(k),  /]  (2) 

=  P^.[/?i(k)  =  I  W;,(t)(k),  =(« 
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Equation  2  is  true  because  we  condition  on  //{k) 
and  because  i/(t)  n^(t)  =  Up^^^(k).  Equation  3  is 
true  because  M(k)  is  independent  from  the  r.v.s 
as  is  shown  in  Remark  (2)  above. 

We  then  notice  that  the  events  {A(k)  = 

U]iuu/?,-(ib)}  and2/^^^^(k)  (and  hence  their  intersec¬ 
tion)  are  defined  in  terms  of  the  r.v.s  .  From  re¬ 
mark  (3)  above,  the  value  of  Eq.  3  depends  only  on 
m  and  is  therefore  independent  of  i.  Hence,  for  all  t 
and  j  in  V{k),  P^-[W<(ib)  ]  W(ik),/  ]  =  PAWjik)  \ 
U{k),  /]. 

On  the  other  hand, 

I  Wp(t)(k),  Tt-i  =  p  ]  =  1:  indeed, 
one  of  the  0i-  has  to  attain  the  maximum. 

These  last  two  facts  imply  that,  Vi  €  V{k), 

I  U(k),  I  ]  =  1/m. 

We  now  turn  to  the  evaluation  of  Pjii[U{k)  |  /  ]. 

P>'lW(fc)l7]  =  P^.[W^(t)(fc)l/]  (4) 

=  ^•‘-i=p]  (5) 

=  P[W(;„,(k)]  >  2/3  .  (6) 

Equation  4  is  true  because  we  condition  on  M{k). 
Eq.  5  is  true  because  and  M{k)  are  indepen¬ 

dent  (See  Remark  (2)  above).  The  equality  of  Eq.  6 
stems  from  Remark  (3)  above  and  the  inequality  from 
Eq.  1. 

We  can  now  finish  the  proof  of  Theorem  3.5. 

P.t'[Wi(/fc)|  /] 

>  PA'[Wi{k),  U{k)  I  I  ] 

=  PA‘[Wi{k)  1  W(fc),  I  ]  PA-[U{k)  1  /  ]  >  2/3  m  . 

■ 

We  discuss  here  the  lessons  brought  by  our  results. 
(1)  Conditioning  on  ^^(k)  is  equivalent  to  force  the 
algorithm  to  refresh  all  the  variables  at  each  round. 
By  doing  this,  we  took  care  of  the  undesirable  linger¬ 
ing  effects  of  the  past,  exemplified  in  Theorems  3.2 
and  3.4.  (2)  It  is  noi  true  that; 

PA[l3iik)  =  \V{k)\  =  m]  = 

p[/?i(k)=  MM/7,(k)|f/(;„j(/fe)],  ' 

i.e.,  that  the  adversary  has  no  control  over  the  event 
{I3i(k)  =  Max  /9j(k)}.  (This  was  Rabin’s  statement 

in  [6].) 

Indeed,  the  latter  probability  is  equal  to  1/m 
whereas  we  proved  in  Theorem  31  that  there  is  an 
adversary  for  which  the  former  is  0  when  m  <  n  —  1. 


The  crucial  remark  explaining  this  apparent  parar 
dox  is  that,  implicit  in  the  expression  P>([/9,-(k)  = 
Max  jS,  (k)  I  . .  .1,  is  the  fact  that  the  random  vari- 

ables  ^j(k)  (for  j  €  P(k))  are  compared  to  each  other 
in  a  specific  way  decided  by  A,  before  one  of  them 
reveals  itself  tr  be  the  maximum.  For  instance,  in 
the  example  constructed  in  the  proof  of  Theorem  3.1, 
when  j  takes  a  step,  0j(k)  is  compared  only  to  the 
A(k);  I  <  j,  and  the  situation  is  not  symmetric 
among  the  processes  in  'P(ib). 

But,  if  the  adversary  is  restricted  as  in  our  Defi¬ 
nition  3.2,  the  symmetry  is  restored  and  the  strong 
no-lockcut  property  holds. 

Rabin  and  Kushilevitz  used  these  ideas  from  our 
analysis  to  produce  their  algorithm  [7]. 

In  our  last  Theorem  3.5  we  used  the  restriction  on 
the  adversary  A'  mostly  to  derive  a  1/m  bound.  If  we 
consider  a  general  adversary  A  it  is  interesting  to  note 
that  we  can  still  ensure  the  weak  lockout-property: 

Theorem  3.6  For  every  process  »  =  1, ..  .,n,  for  ev¬ 
ery  round  k  >  1,  for  every  adversary  A  and  for  every 
(k  —  l)-round  run  p  compatible  with  A, 

PA[Wi(k)  1  J^(k),  xt_,  =p,ie  7>(i)]  >  .l/n. 

whenever  /J‘(k),  =  p,  i  e  P(/fc)j  0  . 

Proof:  Omitted.  ■ 

This  theorem  holds  also  if,  as  in  the  context  of  the¬ 
orem  3.4,  the  algorithm  uses  b  lottery  numbers.  This 
shows  that  the  result  of  Theorem  3.6  is  not  trivial:  in¬ 
deed,  when  6  =  21og2,  the  probability  P[/?i(ifc)  =  6]  of 
drawing  the  highest  possible  number  is  a  o(l/rr).  One 
of  the  difficulties  of  the  proof  is  that  the  apparently 
innocuous  event  {»  €  P(k)}  is  in  the  future  of  the 
point  t(k  -  1)  at  which  the  probability  is  estimated: 
the  adversary  could  conceivably  also  use  this  fact  to 
ensure  some  specific  values  of  the  variables  when  » 
participates. 

Our  Theorems  3.1,  3.2  and  3.4  explored  how  the 
adversary  can  gain  and  use  knowledge  of  the  lottery 
values  held  by  the  processes.  The  next  theorem  states 
that  the  adversary  is  similarly  able  to  derive  some 
knowledge  about  the  round  numbers,  contradicting 
the  claim  in  [6]  that  “because  the  variable  R  is  ran¬ 
domized  just  before  the  start  of  the  round,  we  have 
with  probability  0.99  that  R,  ^  R”  Note  that,  ex¬ 
pressed  in  our  terms,  the  previous  claim  translates 
into  R(k)  Ri(k  -  1). 

Theorem  3.7  There  exists  an  adversary  A,  a  round  k, 
a  step  number  t,  a  run  pt,  compatible  with  A,  having  t 
steps  and  in  which  round  k  is  under  way  such  that 

P^[R(ib)?‘Ri(i-l)|x.  =  p,]<.99. 

Proof: 
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We  will  write  pt  =  pf  p  where  ff  is  a  Jfe  —  1-round  run 
and  p  is  the  run  fragment  corresponding  to  the  ith 
round  under  way.  Assume  that  pf  indicates  that,  be¬ 
fore  round  ib,  processes  1,2, 3,4  participated  only  in 
round  k—l,  and  that  process  5  never  participated  be¬ 
fore  round  k.  Furthermore,  assume  that  during  round 
/b  —  1  the  following  pattern  happened:  A  waited  for 
the  critical  region  to  become  free,  then  allocated  one 
step  in  turn  to  processes  2, 1, 1, 3, 3, 4, 4;  at  this  point 
4  entered  the  critical  region.  (All  this  is  indicated  in 
p/ .)  Assume  also  that  the  partial  run  p  into  round  k 
indicates  that  the  critical  region  became  free  before 
any  competing  process  was  given  a  step,  and  that  the 
adversary  then  allocated  one  step  in  turn  to  processes 
5, 3, 3,  and  that,  after  3  took  its  last  step,  the  critical 
section  was  still  free.  We  will  establish  that,  at  this 
point, 

P,i[i?(ib)  Ri{k  -  1)  1  IT,  =  p'p]  <  .99  . 

By  assumption  )b  —  1  is  the  last  (and  only)  round 
before  round  k  where  processes  1,2,3  and  4  partic¬ 
ipated.  Hence  Ri{k  —  1)  =  R2(k  —  1)  =  iZ3(Jb  — 
1)  =  R(k  —  1).  To  simplify  the  notations  we  will 
let  R!  denote  this  common  value.  Similarly  we  will 
write  /?( ,  . . .  in  place  of  0i(k  -  1),  02(k  -  1),  • .  - 

We  will  furthermore  write  /3i,  /?2,...  in  place  of 
0i(k),  02(k), . . .  and  B,  R  in  place  of  B(k),  R(k). 

Using  Bayes’  rule  gives  us: 

p] 

pAR^R'\p']Pa[p\p',R^R']  ,,, 

In  the  numerator,  the  first  term  P^[i?  ^  R'  ]  p']  is 
equal  to  0.99  because  R  is  uniformly  distributed  and 
independent  from  R'  and  p'.  We  will  use  this  fact 
another  time  while  expressing  the  value  of  P^[p  |  pf]  : 

PA\p\f/\ 

=  Pj,[p\pf,Ri:R']PA[Ri:R!\p'] 
-kPA\p\f^,R=R!]PA[R=R;\p'] 

=  0.99P^[plp',/?#iZ']  (8) 

-f-0.01P^[p|p',/Z=fl']. 

•  Consider  first  the  case  where  R^  Rf.  Then  pro¬ 
cess  3  gets  a  YES  answer  when  going  through  the 
test  “{V.R  ^  R3)  or  {V.B  <  B3)” ,  and  consequently 
chooses  a  new  value  B3(ib)  =  03.  Hence 

PA[p\f/,R¥^Iif]  =  P[03<0!>].  (9) 

•  Consider  now  the  case  R  =  R'.  By  hypoth¬ 
esis,  process  5  never  participated  in  the  computai- 
tion  before  round  k  and  hence  draws  a  new  number 


B3{k)  =  03.  Hence: 

PA[p\p',R=Rf]  = 

PA[B3ik)<03\(/,R=Rf].  (10) 

As  processes  1,...,4  participated  only  in  round 
t  —  1  up  to  round  k,  the  knowledge  provided  by  // 
about  process  3  is  exactly  that,  in  round  k  —  l,  pro¬ 
cess  3  lost  to  process  2  along  with  process  1,  and 
that  process  2  lost  in  turn  to  process  4,  i.e.,  that 

03  <  02>  01  ^2  02  <  0^-  Pof  8^®  of  OO" 

tational  simplicity,  for  the  rest  of  this  paragraph  we 
let  X  denote  a  random  variable  whose  law  is  the  law 
of  02  conditioned  on  {02  >  Max{/?i,/^},/^  <  0!^}. 

This  means  for  instance  that,  Vz  G  R, 

P[X  >x]  =  p[^2  >x\0'2>  Max{^„  <  /?;] . 

When  3  takes  its  first  step  within  round  k,  the  pro¬ 
gram  variable  V.B  holds  the  value  0^.  As  a  conse¬ 
quence,  3  chooses  a  new  value  when  and  exactly  when 
B3(k  —  1)(=  0^)  is  strictly  bigger  then  0s.  (The  case 
/^  =  would  lead  3  to  take  possession  of  the  critical 
section  at  its  first  step  in  round  k,  in  contradiction 
with  the  definition  of  p;  and  the  case  0^  <  03  leads  3 
to  keep  its  “old”  lottery  value  B3(k  —  1).)  From  this 
we  deduce  that: 

.  .>3(k)  <03  \  f/,R=Rf]  =  P[0'3<03\  0'3<X 
■kP\0'3>0i,  0Z<0i\  0'3<X].  (11) 

Using  Lemma  4.5  we  derive  that: 

P[0'3  <0A  0'z<^]>  <  0^]- 

On  the  other  hand  P[/?3  <  /?$]  =  P\03  <  /Js]  because 
all  the  random  variables  0i{j), »  =  1, . .  • ,  n,  j  >  1  are 
iid.  Taking  into  account  the  fact  that  the  last  term 
of  equation  11  is  non  zero,  we  have  then  established 
that; 

PA[B3{k)  <0i\  p\R  =  Rf]>  P\03<  0i]  (12) 

Combining  Equations  9,  10  and  12  yields: 

PA[p\f/,R=Rf]>PA\p\f/, 

Equation  8  then  shows  thav  Pa\p  |  p']  >  Pa\p  ) 
p',  R  ^  f2'].  Plugging  this  result  into  Equation  7 
finishes  the  proof.  ■ 

We  finish  with  a  result  showing  that  all  the  prob¬ 
lems  that  we  encountered  in  Rabin’s  algorithm  carry 
over  for  Ben-Or’s  algorithm.  Ben-Or’s  algorithm  is 
cited  at  the  end  of  [6].  The  code  of  this  algorithm  is 
the  same  as  the  one  of  Rabin  with  the  following  mod¬ 
ifications.  All  variables  B,R,Bi,Ri\  \  <  i  <  n  are 
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boolean  variables,  initiaJly  0.  The  distribution  of  the 
lottery  numbers  is  also  different  but  this  is  irrelevant 
for  our  discussion. 

We  show  that  Ben-Or’s  algorithm  does  not  satisfy 
the  weak  no-lockout  property  of  Definition  2.2.  The 
situation  is  much  simpler  then  in  the  case  of  Rabin’s 
algorithm;  here  all  the  variables  are  boolean  so  that 
a  simple  reasoning  can  be  worked  out. 

Theorem  3.8  (Ben  Or’s  Alg.)  There  is  an  adver¬ 
sary  A.  a  step  number  t  and  a  run  pt  compatible  with 
A  such  that 

P>«[w^2(*)  I  T,  =  />,,  2  6  P(Jb)]  =  0  . 

Proof:  Assume  that  we  are  in  the  middle  of  round 
3,  and  that  the  run  pt  indicates  that  (at  time  0  the 
critical  section  was  free  and  then  that)  the  schedule 
1  2  2  3  3  was  followed,  that  at  this  point  3  entered  in 
Crit,  that  it  left  Crit,  that  at  this  point  the  schedule 
4  115  5  was  followed,  that  5  entered  and  then  left 
Crit,  that  6  4  4  then  took  a  step  and  that  at  this 
point  Crit  is  still  free. 

Without  loss  of  generality  assume  that  the  round 
number  R(l)  is  0.  Then  R^il)  =  0,  fii(l)  =  1  and 
B2(1)  =  0:  if  not  2  would  have  entered  in  Crit.  In 
round  2  it  then  must  be  the  case  that  12(2)  =  1. 
Indeed  if  this  was  not  the  case  then  1  would  have  en¬ 
tered  the  critical  section.  It  must  then  be  the  case 
that  Bi(2)  =  0  and  ^4(2)  =  1.  And  then  that 
Be  (3)  =  1  and  i2(3)  =  0:  if  this  was  not  the  case 
then  4  would  have  entered  in  Crit  in  the  3rd  round. 

But  at  this  point,  2  has  no  chance  to  win  if  sched¬ 
uled  to  take  a  step!  ■ 
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Theorem  4.1  and  its  corollaries  are  used  in  the  con¬ 
struction  of  the  adversary  in  Theorem  3.2  and  The¬ 
orem  3.4.  Lemma  4.5  is  used  mostly  in  the  proof  of 
Theorem  3.7.  The  proofs  can  be  found  in  [8]. 
Definition  4.1  For  any  sequence  (a,)jgp)  we  denote 

Max,a<  =  Max{ai,a2,...,a,}- 

In  this  section  the  sequence  (/?,  )  is  a  sequence  ofiid 
geometric  random  variables; 

m  =  l]  =^:  /  =  l,2,... 

The  following  results  are  about  the  distribution  of  the 
extremal  function  Max,/?,-.  The  same  probabilistic 
results  hold  for  iid  random  variables  {^),  having  the 
truncated  distribution  used  by  Rabin;  we  just  need  to 
truncate  at  iog2n-|-4  the  random  variables  /?,  and  the 
values  that  they  take.  This  does  not  affect  the  proba¬ 
bilities  because,  by  definition,  P[/3J(lb)  =  log2n  -F  4]  = 

H/>log2n+4 

Theorem  4.1  For  j*  *  <  1/2  we  have  the  following 
approximation; 


Lemma  4.5  Let  B  and  A  be  any  real-valued  random 
variables.  Then 

Vr  6  R,  P[B  >  r  1  B  <  A]  <  P[B  >  x].  * 

^  We  use  the  convention  that  0/0  =  0  whenever  this  quantity 
arises  in  the  computation  of  conditional  probabilities. 


4  Appendix 


A  =  P[Max,B«  >  log2S  -F  a:]  ~  1  -  c 


Corollary  4.2  P[Max,/?,-  >  logjS  -  4]  >  1  -  e~^^. 

Corollary  4.3  P[Max,Bi  >  log2*  +  8]  <  0.01  . 

Corollary  4.4  P[Max,/?,-  =  log2s]  >  0.17, 

P[Max,Bi  =  log2*  +  ^  0.01,  V/  =  1, . . ., 5  . 
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Randomized  Mutual  Exclusion  Algorithms  Revisited* 


Eyal  Kushilevitz^ 


Abstract 

In  [4]  a  randomized  algorithm  for  mutual  ex¬ 
clusion  with  bounded  waiting,  employing  a 
logarithmic  sized  shared  variable,  was  given. 
Saias  and  Lynch  [5]  pointed  out  that  the  ad¬ 
versary  scheduler  postulated  in  the  above  pa¬ 
per  can  observe  the  behavior  of  processes  in 
the  interval  between  an  opening  of  the  criti¬ 
cal  section  and  the  next  closing  of  the  critical 
section.  It  can  then  draw  conclusions  about 
values  of  their  local  variables  as  well  as  the 
value  of  the  randomized  round  number  com¬ 
ponent  of  the  shared  variable,  and  arrange  the 
schedule  so  as  to  discriminate  against  a  chosen 
process.  This  invalidates  the  claimed  proper¬ 
ties  of  the  algorithm. 

In  the  present  paper  the  algorithm  in  [4]  is 
modified,  using  the  ideas  of  [4],  so  as  to  over¬ 
come  this  difficulty,  obtaining  essentially  the 
same  results.  Thus,  as  in  [4],  randomization 
yields  simple  algorithms  for  mutual-exclusion 
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with  bounded  waiting,  employing  a  shared 
variable  of  considerably  smaller  size  than  the 
lower-bound  established  in  [1]  for  determinis¬ 
tic  algorithms. 

1  Introduction 

In  this  paper  we  deal  with  the  well-known 
mutual- exclusion  problem:  Let  Pi,  •  •  • ,  Pfj  be 
N  processes  that  from  time  to  time  need  to 
execute  a  critical  section  in  which  exactly  one 
is  allowed  to  employ  some  shaied  resource. 
They  can  coordinate  their  activities  by  use  of 
a  shared  test-and-set  variable  v  (i.e.,  testing 
and  setting  v  itself  is  an  atomic  action,  auid 
access  to  v  is  always  available  to  a  Pi  sched¬ 
uled  to  do  so).  This  problem  was  suggested  by 
Dijkstra  [2]  and  was  discussed  in  many  papers 
since  then  (see,  for  example,  [1,  4]  and  the  lit¬ 
erature  cited  there).  A  solution  for  this  prob¬ 
lem  is  an  algorithm  that  guarantees  freedom 
from  deadlock  (this  alone  can  be  achieved  by 
the  use  of  a  one-bit  semaphore)  and  freedom 
from  lockout. 

Burns  et.  al.  [1]  considered  the  follow¬ 
ing  question:  What  should  be  the  size  of 
the  shared  variable  v  so  that  (deadlock-free, 
lockout-free)  mutual-exclusion  can  be  imple¬ 
mented?  This  question  is  not  only  of  theo¬ 
retical  interest  but  also  of  practical  interest. 
This  is  because  in  practice  test-and-set  is  not 
an  atomic  operation  and  what  we  really  as¬ 
sume  is  that  reading  the  variable  and  immedi¬ 
ately  writing  it  can  be  done  very  fast  so  that 
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no  other  operations  will  interrupt  it.  Such 
an  assumption  is  reasonable  only  for  “short” 
variables.  Burns  et.  al.  [1]  proved  that 
if  deterministic  algorithms  are  used  then  an 
fl(log  7V)-bit  shared  variable  is  required,  and 
in  fact  this  number  of  bits  is  also  sufficient. 

Rabin  [4],  presented  a  randomized  solu¬ 
tion  for  the  problem  using  an  0(log  log  A^)-bit 
shared  variable.  His  algorithm  was  based  on 
the  following  lemma: 

For  any  1  <  m  <  N  processes, 
say  Pi,  - ,Pm,  if  each  P,  randomly, 
with  a  geometric  distribution,  draws 
a  number  1  <  <  log  N  +  A, 

then  with  probability  at  least  2/3 
6(J)  =  max  6^*^  will  hold  for  exactly 
one  j. 

The  algorithm  works  in  rounds,  where  a  round 
is  defined  to  be  the  time  between  two  succes¬ 
sive  entrances  to  the  critical  section.  To  ex¬ 
plain  the  algorithm  we  provisionally  assume 
that  the  shared  variable  v  contains  a  field, 
updated  by  the  process  entering  the  critical 
section,  representing  the  current  round  num¬ 
ber  r.  In  addition  the  shared  variable  con¬ 
tains  a  flag  to  indicate  whether  the  critical 
section  is  close  or  open,  and  a  field  b  that 
contains  the  maximum  number  drawn  during 
this  round.  Each  process  trying  to  enter  the 
critical  section  during  round  r,  upon  accessing 
the  shared  variable,  draws  a  number  accord¬ 
ing  to  the  geometric  distribution,  and  updates 
the  field  b  in  the  shared  variable  by  the  maxi¬ 
mum  among  his  number  and  the  current  value 
of  b.  If  it  already  drew  a  number  in  round  r  it 
does  not  draw  a  number  again.  If  the  critical 
section  is  open  and  the  number  it  drew  equals 
b  (the  maximal  number  drawn  in  this  round), 
it  enters  the  critical  section  and  starts  a  new 
round. 

This  solution  is  not  only  deadlock-free  and 
lockout-free  but  also  satisfies  (using  the  above 


lemma)  a  powerful  fairness  property:  If  a  pro¬ 
cess  Pi  participates  in  a  trying  round  together 
with  m  other  processes,  it  has  a  probability  of 
fl(l/m)  to  enter  the  critical  section  at  the  end 
of  this  trying  round.  This  property  is  called  in 
[4]  hounded  waiting.  It  was  also  shown,  based 
on  an  idea  of  Ben-Or,  that  this  algorithm  can 
be  modified  so  as  to  use  just  a  constant  size 
shared  variable.  This  version,  however,  guar¬ 
antees  only  a  weaker  fairness  property:  If  a 
process  P,  participates  in  a  trying  round,  it 
has  a  probability  of  Vl{l/N)  to  enter  the  crit¬ 
ical  section  at  the  end  of  this  trying  round. 
This  is  a  much  weaker  property  since  in  prac¬ 
tice  m,  the  number  of  processes  competing  for 
the  critical  section  during  a  trying  round,  is 
typically  much  smaller  than  N,  the  number 
of  processes  in  the  system.  It  is  important  to 
remark  that  in  both  versions  of  the  algorithm 
deadlock  is  never  possible. 

The  difficulty  with  these  algorithms  is  that 
we  assumed  that  the  unbounded  round  num¬ 
ber  r  is  part  of  the  shared  variable  v.  (All  the 
other  fields  of  v  are  of  the  appropriate  size.) 
The  idea  for  dealing  with  this  problem  was 
to  replace  the  use  of  the  round  number  r  by 
a  randomized  round  number  (i.e.,  a  random 
bit  chosen  by  the  process  entering  the  critical 
section).  The  intuition  was  that  a  randomized 
round  number  is  enough  to  guarantee  that  a 
process  will  not  draw  a  number  more  than 
once  at  a  trying  round,  and  it  seemed  that 
the  probability  that  a  process  will  not  draw  a 
number  at  all  is  exactly  1/2.  Therefore,  the 
same  analysis  seemed  to  work. 

Recently,  Saias  and  Lynch  [5]  showed  that 
this  is  not  true.  They  presented  some  ex¬ 
amples  in  which  an  adversary  scheduler  can 
lockout  a  process  P,.  Summarizing  these  ex¬ 
amples  there  are  two  problems  with  the  use  of 
randomized  round  number  instead  of  the  ac¬ 
tual  round  number  in  the  above  algorithms: 
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•  Processes  that  lost  in  one  trying  round 
(i.e.,  did  not  enter  the  critical  section) 
may  remain  with  “high”  lottery  numbers. 
Such  an  occurance  can  be  observed  by 
the  adversary  according  to  the  external 
behavior  of  the  processes  (without  look¬ 
ing  at  the  content  of  their  local  vari¬ 
ables  nor  at  the  shared  variable).  Later, 
these  processes  can  participate  in  a  try¬ 
ing  round  which  has  the  same  random¬ 
ized  round  number,  together  with  the 
process  Pi.  Therefore,  they  will  not  draw 
new  numbers  (and  remain  with  the  old 
“high”  numbers)  and  hence  the  proba¬ 
bility  of  Pi  to  win  the  lottery  in  such  a 
case  is  smaller  than  it  should  be. 

•  The  adversary  can  learn  whether  the  cur¬ 
rent  randomized  round  number  equals 
the  randomized  round  number  in  the  last 
trying  round  P,  participated  in,  by  ob¬ 
serving  the  external  behavior  of  other 
processes  participated  with  P,  in  that 
previous  round.  Then,  the  adversary  can 
schedule  p  only  in  rounds  with  same  ran¬ 
domized  round  number.  This  will  cause 
Pi  not  to  draw  a  new  number  and  this 
again  decreases  the  probability  of  Pi  to 
win  the  lottery. 

We  modify  the  algorithms  presented  in  [4] 
using  shared  variables  with  the  same  number 
of  bits  (up  to  a  constant)  achieving  the  same 
fairness  properties.  The  key  ideas  for  over¬ 
coming  the  above  two  problems  are: 

•  The  modified  algorithms  make  sure  that 
processes  which  lost  in  a  lottery  will  not 
keep  high  lottery  numbers  that  can  be 
used  in  future  rounds. 

•  In  the  modified  algorithms,  each  trying 
round  is  divided  into  two  parts:  the 
drawing  round  in  which  the  processes 


draw  their  numbers,  and  the  notification 
round  in  which  the  processes  that  took 
part  in  the  drawing  round  find  out  who 
win  and  who  lost.  The  idea  is  that  dur¬ 
ing  the  drawing  round,  the  external  be¬ 
havior  of  processes  does  not  depend  on 
the  outcome  of  the  lottery,  and  there¬ 
fore  during  this  time  the  adversary  can¬ 
not  gain  information  about  the  contents 
of  the  shared  variable  nor  the  local  vari¬ 
ables  of  the  processes.  During  the  notifi¬ 
cation  round,  the  lottery  is  alreaxiy  over 
and  the  winner  and  the  losers  are  already 
determined. 

We  believe  that  the  separation  idea  may  be 
useful  in  the  design  of  other  randomized  pro¬ 
tocols.  To  summarize:  randomization  yields 
simple  algorithms  for  mutual-exclusion  with 
bounded  waiting,  employing  a  shared  variable 
of  considerably  smaller  size  than  the  lower- 
bound  for  deterministic  algorithms. 

2  The  Modified 

Algorithm 

2.1  Definitions  and  General 
Plan 

Let  Pi,-'-,Pn  be  the  N  processes  in  the 
system.  The  processes  coordinate  their  ac¬ 
tivities  by  use  of  a  shared  test-and-set  vari¬ 
able  V  =  (&even  1  ^eveni  Pi 

where  each  b  component  is  B  =  log2  N  +  4- 
valued,  and  the  other  components  are  0, 1 
valued.  Each  process  Pi  has  local  variables 
6b),  rb),  rfb)  and  some  flags.  Variables  local  to 
Pi  are  denoted  xb). 

The  processes  will  use  a  geometric  distri¬ 
bution  to  pick  numbers  between  1  and  B. 
Namely,  each  1  <*<B  —  lis  drawn  with 
probability  ^  and  B  is  drawn  with  probabil- 
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ity  2^-  shall  denote  the  act  of  drawing 
such  a  number  by  :=  randoml.  It  was 
shown  in  [4]  that  for  any  1  <  m  <  iV  pro¬ 
cesses,  say  Pi,'  •  •  ,Pm,  if  each  Pi  randomly, 
with  a  geometric  distribution,  draws  a  num¬ 
ber  1  <  <  B,  then  with  probability  at 

least  2/3  =  max  6^*^  will  hold  for  exactly 

one  j. 

During  the  computation,  each  process  Pi  is 
in  one  of  four  possible  phases:  Trying  phase, 
in  which  it  attempts  to  enter  the  critical  sec¬ 
tion,  Critical  phase,  in  which  it  executes  the 
critical  section.  Exit  phase,  in  which  it  leaves 
the  critical  section,  or  Remainder  phase,  in 
which  it  does  local  computations.  In  this  pa¬ 
per,  we  assume  the  same  adversary  scheduler 
that  was  postulated  in  [4].  Namely,  at  any 
given  time  the  adversary  scheduler  can  ob¬ 
serve  the  external  behavior  of  the  processes 
(i.e.,  which  of  the  four  phases  each  process 
currently  executes),  and  use  this  information 
(together  with  its  information  on  the  past  be¬ 
havior  of  the  processes)  to  determine  which 
process  will  be  the  next  to  access  the  shared 
variable.  The  adversary  scheduler  cannot  ob¬ 
serve  the  content  of  the  shared  variable  nor 
the  content  of  any  local  variable.  More  for¬ 
mally,  let  a  run  be  a  (finite  or  infinite)  se¬ 
quence  (ii,  Xi), . . . ,  (ik,  Xfc), . . .,  where  xj  in¬ 
dicates  which  phase  process  pi-  started  or 
whether  it  accessed  the  shared  variable.  A 
scheduler  is  a  (probabilistic)  function  that  on 
a  finite  run  a  gives  the  name  of  the  next  pro¬ 
cess  to  access  the  shared  variable.  A  run  is 
proper  if  it  satisfies  the  obvious  consistency 
conditions. 

The  whole  computation  by  Pi ,  •  •  • ,  P/v  lead¬ 
ing  to  entrance  into  the  critical  section  is  or¬ 
ganized  in  intervals  (the  Et,Ct  are  logical,  not 
physical,  times). 


•  •  •  Ct-2  Et-2  Ct~i  Et-i  Ct  Et  . . . 


where  [Cj,  Et)  is  the  t-th  critical  section,  with 
Ct  (and  s  :=  1)  marking  the  entry,  Et  (and 
s  :=  0)  marking  the  exit  from  this  section. 

In  [4],  the  processes  Pi  trying  to  enter  the 
next  critical  section  [Ct,  Et)  drew,  during  the 
interval  \Ct-\,Ct),  tickets  :=  randoml  and 
posted  the  result  in  u  by  6  :=  max(6, 6^‘^). 
During  [Et-i,Ct)  the  first  process  to  arrive 
with  a  highest  ticket  6^)  =  g  enters  the  crit¬ 
ical  section.  The  crucial  modification  in  the 
present  algorithm  is  the  following.  In  the  f-th 
drawing  round  {Ct-2,  Ct-i),  participating  pro¬ 
cesses  draw  numbers  to  determine  the  win¬ 
ning  process  P,-  enabled  to  enter  the  t-th  crit¬ 
ical  section  at  time  Ct.  By  time  Ct-i,  all  but 
at  most  B  of  the  participants  in  the  drawing 
round  [C'«_2,C<_i)  will  know  that  they  lost  in 
this  drawing.  In  the  t-th  notification  round 
Ct)  each  of  the  remaining  participants 
finds  out  whether  it  won  or  lost,  thus  by  time 
Ct  there  will  be  a  Pi  knowing  that  it  alone  won 
entrance  to  the  critical  section.  The  whole 
computation  is  arranged  so  that,  for  every  t, 
the  t  1-st  drawing  round  overlaps  with  the 
t-th  notification  round,  as  indicated  in  Fig¬ 
ure  1. 

In  other  words,  the  interval  [Ct-i,  Ct)  serves 
both  as  the  t-th  notification  round  and  as  the 
(t  l)-th  drawing  round.  To  disambiguate 
this  dual  function  of  an  interval  [Ci_i,  Ct),  the 
intervals  are  classified  as  even  and  odd.  The 
parity  component  p  of  u  will  satisfy  in  the 
above  interval  p  =  (<  —  1)  mod  2.  Thus  p  =  0 
signifies  an  even  drawing  round  for  entrance 
in  the  following  even  critical  section  at  Ct+i, 
as  well  as  an  odd  notification  round  to  notify 
the  winner  in  [Ct_2,  Ct_i)  -  the  just  previous 
odd  drawing  round. 
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Round  t: 
Round  i  +  1: 


Drawing 


Notification 

Drawing 


Notification 


Figure  1:  Organization  of  the  computation  in  rounds 


The  key  point  is  that  since  the  drawing  de¬ 
termining  the  winner  Pi  enabled  to  enter  at 
Ci  ends  at  time  Ct-i,  before  the  opening  of 
the  critical  section  at  Et-i,  there  will  be  no 
possibility  for  the  scheduler  to  infer  signifi¬ 
cant  information  about  processes’  local  vari¬ 
ables  or  about  the  r  component  of  v  while  the 
drawing  round  [Ct-2,Ct-i)  is  in  progress.  It 
will  also  be  seen  that  all  participants  losing  in 
a  drawing  leave  that  drawing  with  their  local 
lottery  ticket  =  0.  Thus  the  scheduler  cannot 
hoard  processes  with  large  ticket  values  and 
use  them  in  later  drawing  rounds,  as  it  could 
in  the  original  version  [4]  of  the  algorithm. 

2.2  Detailed  Description  of  the 
Algorithm 

When  Pi  is  scheduled  to  test  and  set  v,  it 
knows  by  looking  at  local  and  global  flags 
whether  it  should  execute  in  some  drawing 
round  and,  if  yes,  whether  t  —  1  is 

even  or  odd.  Assume  Pi  is  in  an  even  drawing 
round.  It  executes  the  protocol  of  Figure  2. 

The  test  (1)  in  that  protocol  ensures  that  Pi 
will  not  draw  more  than  once  in  [Ct_i,Ct).  If 
in  the  drawing  it  did  not  exceed  the  current 
value  6even>  then  it  knows  it  lost  and  leaves 
with  =  5^')  =  0.  Later  it  will  not  execute 
Notification,  only  Drawing. 

Assume  that  Pi,...,Pk  participated  in 
[Ct_i,C«)  and  executed  (2)-(3)  (i.e.,  each  of 
them  at  the  time  he  drew  his  number  was  a 
local  maxima).  Then  by  Ct  we  have  beven  = 
d^^^  +  . . .  -H  (intuitively,  d^'^  is  the  contri¬ 
bution  of  Pi  to  beven)  and  the  unique  winner  is 


actually  already  determined.  In  particular,  it 
follows  that  the  number  of  processes  executed 
(2)-(3)  is  small  (in  fact,  it  cannot  be  larger 
than  B).  All  the  other  processes  that  par¬ 
ticipated  in  the  drawing  round  already  know 
they  lost.  This  is  very  important  because  in 
the  claimed  size  of  v  we  cannot  count,  for  ex¬ 
ample,  the  number  of  processes  which  partic¬ 
ipated  in  the  lottery.  The  way  we  m£Lke  sure 
that  all  the  processes  Pi,..  .^Pk  will  take  part 
in  the  notification  round  is  by  letting  each 
Pi  to  subtract  its  contribution  d^'^  from  freven* 
When  all  of  them  are  notified  we  will  have 

beven  ~  0- 

Notifying  the  winner  and  the  losers 
amongst  P\,.  ..,Pk,  is  done  in  the  even  no¬ 
tification  round  [Ct,Ct+i).  At  time  Ct  we 
have  5  =  0,p  =  1  (an  odd  drawing  round) 
and  u;  =  0  (winner  not  yet  notified).  The 
structure  of  the  Notification  round  is  as  fol¬ 
lows:  the  processes  who  lost  are  waiting  un¬ 
til  the  winner  is  notified.  Then,  the  losing 
processes  are  all  notified  and  only  then  the 
winner  is  enabled  to  enter  the  critical  section 
(if  it  is  open).  By  looking  at  flags,  every  Pi 
knows  whether  for  it  [04,(74+1)  is  a  Notifica¬ 
tion  round  (or  a  Drawing  round).  It  executes 
the  protocol  of  Figure  3. 

After  Pi  is  enabled  it  must  wait  until  the 
critical  section  will  be  open.  It  executes  the 
protocol  of  Figure  4.  The  salient  point  in 
that  protocol  is  that  the  round  count  number, 
»'ei;en)  is  Set  to  a  random  value.  It  is  on  this 
feature  that  the  following  proof  of  Lemma  1 
rests. 

Note  that  partic  =  1  at  Ct+i  if  and  only  if 
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Even  Drawing  for  Pi  (p  =  0): 

(1)  lf[r<0 

#  reven  V  beven  =  0]  A  [d^l  =  0  A  tu'*'  =  0]  then 

(*  Pi  has  not  drawn  in  [Ct-i,Ct) 

begin 

r(0  —  r 

ftb)  ;=  randoml 

If  beven  <  b^'^  then 

and  is  not  participating  in  the 
current  odd  notification  *) 

(*  Pi  is  a  local  maxima  *) 

(2) 

(3) 

•■=  6<-> 

else 

(*  Pi  knows  it  lost  *) 

end 

;=  0 

Figure  2:  Even  Drawing  Protocol 


Even  Notification  for  P,  (p  =  1): 

If  u;  =  0  A  =  b  then 

(*  Pi  is  the  winner  *) 

begin 

«;(«)  :=  1 

(*  Pi  knows  it  won  *) 

lu  :=  1 

(*  winner  was  notified  *) 

beven  •—  I^even 

d(’)  :=  0 

end 

If  tu  =  1  A  =  0  then 

(*  Pi  knows  it  lost  *) 

begin 

b  —  b  — 

6(‘)  :=  0 

(*  clean  up  local  variables  *) 

</(•)  :=  0 

rb)  ;=  nil 

Pi  executes  Odd  Drawing 

end 

If  tub)  =  1  A  bgyen  =  0  then 

all  losers  were  notified  *) 

Pi  is  enabled  to  enter  the  critical  section 

Figure  3:  Even  Notification  Protocol 
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Enabled  Pi  Enters  Even  Critical  Section: 

Ifs  = 

0  then 

(*  critical  section  is  open  *) 

begin 

w  :=  :=  0 

:=  0 
d(’>  :=  0 
rb)  :=  nil 

(*  clean  up  local  variables  *) 

reven  :=  random{Q,  1) 

(*  Assign  0  or  1  with  equal  probabilities  *) 

p:=0 

^  (*  start  of  even  Drawing  *) 

(*  critical  section  closed  *) 

s  :=  1 

If  hodd  >  0  then 

(*  some  process  drew  in  [C'<,Ct+i)  *) 

partic  :=  1 

else 

(*  there  was  participation  *) 

end 

partic  :=  0 

(*  no  participation  *) 

Figure  4:  Entrance  to  Even  Critical  Section 


some  Pj  drew  in  [Ct,Ct+i).  The  above  pro¬ 
tocols  should  be  augmented  by  the  provision 
that  a  trying  process  Pi  accessing  v,  upon 
finding  partic  =  0,  sets  partic  :=  l,w  := 
;=  1  and  is  enabled  to  enter  the  next 
critical  section.  In  the  beginning,  v  is  initial¬ 
ized  with  partic  0. 

2.3  Correctness 

It  goes  without  saying,  that  the  adversary 
scheduler  can  always  discriminate  against  a 
chosen  Pi  by  consistently  scheduling  it  very 
rarely  or  scheduling  it  together  with  many 
other  processes.  However,  given  that  Pi  par¬ 
ticipates  in  a  drawing  round  with  m  —  1  other 
processes,  the  following  lemma  gives  a  lower 
bound  on  the  probability  of  P,  to  enter  the 
critical  section  which  depends  only  on  m,  the 
actual  number  of  participants  in  the  drawing 
round  (and  not  on  N).  It  is  important  to 
note  that  this  probability  is  also  independent 
of  the  past. 


Before  formulating  the  lemma,  we  need  to 
argue  that  the  probability  space  that  we  are 
dealing  with  is  well  defined.  We  say  that 
a  (proper)  run  a  =  . .  ,{ik,Xk)  is  of 

length  t  if  exactly  t  of  the  ij’s  indicating  a 
start  of  a  Critical  phase,  and  one  of  them  is 
Xk-  Let  Vt-i  be  the  viewof  the  system  at  time 
Ct-i,  just  before  reven  was  randomly  set  to  0 
or  1  (we  cissume  that  t  —  1  is  even).  That 
is,  Vt-\  consists  of  the  values  of  the  shared 
variable  and  all  the  local  variables  in  the  sys¬ 
tem  at  this  time.  Given  a,  a  (proper)  run 
of  length  t,  and  a  view  Vt_i  consistent  with 
it  (e.g.,  the  process  that  enters  the  critical 
section  at  time  Ct  should  be  the  winner  of 
the  previous  drawing  round  [C't_2,  Ct-i)  as  re¬ 
flected  by  Vt_i),  the  winner  of  the  drawing 
round  [Ct_i,Ct)  depends  only  on  the  random 
choices  of  the  processes  which  are  in  a  Trying 
phase  during  this  time,  and  the  random  value 
of  Teuen-  It  is  important  to  note  that  the  pro¬ 
cess  that  will  enter  the  critical  section  at  time 
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Ct  is  already  determined  by  Vt-i.  The  ran¬ 
dom  choices  made  during  the  time  [Ct_i,Ct) 
affect  the  values  of  the  shared  variable  and 
the  local  variables,  but  will  affect  the  external 
behavior  of  the  processes  only  at  time  Ct+i 
(until  that  time  they  all  stay  in  the  Trying 
phase).  Therefore,  the  run  <t  is  independent 
of  the  random  choices  made  during  this  time, 
and  hence  we  get  a  well  defined  probability 
space.  (In  a  sense,  we  give  here  an  additional 
power  to  the  adversary  by  allowing  him  a  to¬ 
tal  view  of  the  system  just  before  the  draw¬ 
ing  round  starts.)  Now,  we  can  formalize  the 
lemma: 

Lemma  1:  Let  a  be  any  (proper)  run  of 
length  t,  and  Vi_i  be  any  view  consistent  with 
<T.  If  in  <7  process  P,  participates  in  the  draw¬ 
ing  round  [Ct-i,  Ct)  together  with  m  —  1  other 
processes  and  if  Vt-i  is  the  view  of  the  sys¬ 
tem  at  time  Ct-i,  then  with  probability  at 
least  l/3m  the  process  Pi  will  enter  the  criti¬ 
cal  section  at  time  C(+i. 

Proof:  Assume  t  —  1  is  even.  At  time 

Ct_i  the  value  of  the  component  rgven  was 
randomly  set  to  0  or  1. 

Clearly,  all  processes  which  do  not  partici¬ 
pate  in  the  drawing  round  [Ct^Ct-\)  have  no 
influence  on  the  drawing  (they  either  access 
only  the  fields  of  the  shared  variable  con¬ 
nected  with  the  odd  drawing  rounds  or  do 
not  access  the  shared  variable  at  all).  It  is 
also  guaranteed  by  the  algorithm  that  every 
process  Pj  that  first  joins  the  drawing  has 
=  0. 

As  claimed  above,  even  though  we  allow  the 
scheduler  full  information  on  the  past  (by  giv¬ 
ing  him  Vt-i),  and  in  particular  we  allow  him 
to  look  at  of  all  Pj's,  the  run  cr  is  inde¬ 
pendent  of  the  value  of  rg^ern  and  in  particular 
is  independent  of  whether  ^  re„e„  at  the 
time  that  Pi  first  access  v  during  the  drawing 


round.  Hence,  as  reven  chosen  to  be  0  or  1 
uniformly,  if  nil  then  with  probability 

1/2  we  have  ^  re„e„  when  Pi  first  accesses 
V  in  [Ct-i,Ct).  In  such  a  case  Pi  draws  a  new 
1  <  fcb)  <  B  in  this  round.  (If  =  nil  then 
Pi  draws  a  new  number  in  this  round  with 
probability  1.) 

Every  other  process  Pj  of  the  m  partic¬ 
ipating  processes  draws  its  random  at 
most  once,  during  the  current  drawing  round. 
Those  Pj’s  which  do  not  actually  draw  have 
6^-')  =  0  and  do  not  affect  beven  at  all.  By 
Rabin‘s  lemma  (quoted  in  the  Introduction), 
the  probability  of  having  a  unique  winner  is 
at  least  2/3  (no  matter  what  is  the  number 
of  other  processes  that  draw  new  numbers). 
Given  that  there  is  a  unique  winner  then,  by 
symmetry  argument,  each  process  that  draws 
a  new  number  has  the  same  probability  to  be 
the  winner.  (Note  that  there  is  no  symme¬ 
try  in  case  that  the  winner  is  not  unique;  the 
process  that  draws  the  maximal  number  first 
is  the  winner.)  All  together,  with  probability 
at  least  1/2  x  2/3  x  1/m  =  l/3m,  process  Pi 
will  draw  the  maximum  value  in  this  draw¬ 
ing  round  and  will  be  alone  in  this.  Hence 
6^’^  =  beven  at  time  Ct  and  Pi  must  be  the  one 
to  enter  the  next  critical  section  at  Ct+i-  □ 

The  property  that  for  any  m  <  TV  if  Pi 
is  participating  in  a  drawing  round  together 
with  any  m  —  1  other  processes,  then  with 
probability  l/'ym  it  will  win  entrance  to  the 
critical  section,  is  called  in  [4]  y-bounded  wait¬ 
ing.  Thus  we  have 

Theorem  2  :  Mutual  exclusion  with 

bounded  waiting  can  be  achieved  for  a  N 
processor  system  by  use  of  an  0((log2  TV)^)- 
valued  shared  test-and-set  variable. 

This  is  Theorem  4  of  [4]  except  that  now  we 
require  a  (logj  TV)^-valued  variable,  instead  of 
logj  N  values.  (In  terms  of  the  number  of 
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bits,  both  the  old  and  new  theorems  use  an 
(9(loglog  A'^)-bit  variable.) 

3  Concluding  Remarks 

Another  version  of  the  mutual  exclusion  al¬ 
gorithm  given  in  [4]  is  based  on  a  random 
drawing  suggested  by  Ben-Or.  Specifically, 
Pi  draws  the  value  2  with  probability  l/N 
and  the  value  1  with  probability  1  —  l/N. 
It  is  readily  seen  that  if  Pi  participates  in  a 
drawing  together  with  m  —  l,l<m<A'^, 
other  processes,  then  with  probability  at  least 
ll2eN  the  process  P,  will  be  the  sole  winner. 
Using  this  lottery  we  get 

Theorem  3;  For  an  appropriate  constant  c 
(c  <  3^  •  2®),  starvation-free  mutual  exclusion 
for  N  processes  can  be  achieved  by  use  of  a 
c- valued  test-and-set  variable. 

This  is  the  same  result  as  in  [4].  As  in  [4], 
Theorems  2  and  3  should  be  contrasted  with 
the  lower  bound  results  in  [1].  It  is  shown 
there  that  for  deterministic  algorithms  an  N- 
valued  variable  is  necessary  for  mutual  exclu¬ 
sion  with  bounded  waiting  (in  the  sense  of 

[1]),  and  an  A^/2-valued  variable  is  necessary 
for  starvation-free  mutual  exclusion. 

In  fact,  the  last  algorithm  can  be  further 
simplified:  in  the  Notification  round  all  that 
is  needed  is  for  the  unique  winner  to  find  out 
that  he  won.  The  numbers  held  by  the  los¬ 
ing  processes  are  at  most  1  and  therefore  if 
Pi  participates  in  a  drawing  round  then  still 
with  probability  at  least  \/2eN  the  process  P, 
will  be  the  sole  winner.  Also,  in  the  first  al¬ 
gorithm  b  can  be  duplicated  (thus,  increasing 
the  number  of  bits  in  n  by  a  constant  factor) 
so  that  the  losing  processes  will  not  have  to 
wait  until  the  winner  is  notified.  This  may 
be  helpful  in  scenarios  where  some  of  the  pro¬ 
cesses  are  much  “slower”  than  the  others. 


An  interesting  open  problem  is  to  try  to 
show  a  better  upper  bound  than  the  one  pre¬ 
sented  in  Theorem  2,  or  to  prove  a  lower 
bound.  A  step  in  this  direction  was  achieved 
in  [3]  where  it  is  proven  that  there  is  no  bet¬ 
ter  lottery  with  the  unique  winner  property. 
Therefore,  if  one  tries  to  improve  the  upper 
bound  he  cannot  hope  to  do  that  by  using 
only  a  different  lottery,  but  instead  he  should 
find  a  somewhat  different  algorithm. 
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Errata:  Knowledge  in  Shared  Memory  Systems 

(PODC  1991) 

Michael  Merritt  Gadi  Taubenfeld 


The  errata:  In  subsection  3.2,  after  the  definition  of  knowl¬ 
edge,  appears  the  following  claim:  “Notice  that  when  /3  is  a 
stable  predicate  then  also  KgP  is  a  stable  predicate.”  This 
claim  is  incorrect,  and  should  be  replaced  with:  “Notice  that 
when  is  a  stable  predicate  then  Kgp  is  not  neccessarily 
a  stable  predicate.” 

The  proofs  of  Theorem  1  iuid  Theorem  2  relied  on  this 
incorrect  claim,  emd  are  correct  only  for  P  for  which  KgP  is 
stable.  Following  are  new  proofs  of  Theorem  1  and  Theo¬ 
rem  2  that  hold  in  the  general  case,  in  which  KgP  may  not 
be  stable..  The  authors  regret  the  error,  and  thank  Yoram 
Moses  and  Lenore  Zuck  for  pointing  it  out. 

Recall  that  it  is  assumed  in  the  two  theorems  that  P\  and 
P2  are  disjoint  stable  predicates. 

Theorem  1  In  any  asynchronous  read-write  protocol,  for 
any  run  x,  if  A{pi,  Pi}  at  x  then  for  some  set  of  processes 
Q  where  |5|  =  n  —  1,  ->Og{Kgpi  V  Kgpf)  at  x. 

In  order  to  prove  the  theorem  we  first  prove  two  lemmas. 
Without  loss  of  generality,  we  consider  in  this  proof  only 
deterministic  processes.  That  is,  if  {x;ep)  and  {x\e'p)  are 
runs  then  Cp  =  e'p.  When  p  is  enabled  at  x,  we  denote  by 
OpX  the  unique  extension  of  x  by  a  single  event  of  p.  Finally, 
by  0  we  denote  the  set  of  all  processes  not  in 

Lemma  1  Let  x  and  y  be  finite  runs  and  Q  be  a  set  of  pro¬ 
cesses.  If<>gKgP\  at  X,  xld]y,  and  value(r,x)  =  value{r,y) 
for  every  shared  register  r,  then  -i^Pi  at  y. 

Proof:  Assume  OgKgPx  at  x,  x[C/]y,  and  value(r,x)  = 
value(r,y)  for  every  shared  register  r.  From  RW1-RW2, 
X  is  a  prefix  of  a  p-fair  run  z  where  x[0]z.  Since  OgKgPi 
at  X  there  is  a  finite  run  x  <  x'  <  z  such  that  x[5]x'  and 
KgPi  at  x'.  From  RW\-RWi,  w  =  {y,  (x'  -  x))  is  also  a 
run.  Since  KgPi  at  x'  and  x'{S]w,  it  has  to  be  that  Pi  at 
w.  Since  Pi  is  stable  and  disjoint  from  P2,  it  is  the  case  that 
-<OP2  at  y.  I 

We  use  the  notation  (OK)^,  at  x  as  an  abbreviation  for 
OgKgP  at  X  for  every  set  of  processes  Q  where  \G\  >  n'. 
Notice  that  it  follows  from  Lemma  1  that  for  any  finite  runs 
X  and  y  and  set  of  processes  G  where  |5|  >  n',  if 
at  X,  x[G\y,  and  value{r,x)  =  value{r,y)  for  every  shared 
register  r,  then  -<OPi  at  y. 

Lemma  2  For  any  finite  run  x  and  any  process  p,  if 
^g(KgPi  V  KgPi)  at  X  for  every  set  of  processes  G  where 
\G\  >  n—  1,  A{Pi,pi}  at  x,  andp  is  enabled  at  x  then  there 
exists  y  >  X  such  that  -‘x]p]y  and  A{Pi,Pi}  at  y. 

Proof:  Assume  to  the  contrary  that  for  some  run  x  and 
process  p,  Og  (Kg Pi  ^  KgPi)  at  x  for  every  set  of  processes 


G  where  \G\>  n  —  1,  A{Pi,Pi}  at  x,  p  is  enabled  at  x  and 
there  does  not  exist  v  >  x  such  that  ->x^y  and  A{Pi,Pi} 
at  y.  Observe  that  it  follows  from  the  assumption  that  for 
any  extension  m  of  x  where  -<(/3i  V  Pi)  at  m  and  p  is  enabled 
at  X,  either  (OK)^j/3i  at  Optn  or  (OK)^j/3a  at  o^m.  Assume 
w.Lo.g.  that  {OK)^iPi  at  OpX. 

Since  A{Pi,Pi}  at  x  there  exists  a  finite  extension  z  of 
X  (z  ^  x)  such  that  OP2  at  z  and  for  any  x  <  j/  <  z  it  is 
the  case  that  A{Pi,p2}  at  y.  It  is  important  to  notice  that 
(0A)„_,/?2  at  z.  Rirthermore,  either  at  z  or  (OA),_i/?2  at 
OpZ  (if  OpZ  exists). 

Let  z'  be  the  longest  prefix  of  z  such  that  x[p]z'.  We 
notice  that  either  Pi  at  z'  (when  z'  =  z)  or  (OA!)^,j3j  at 
Opz',  and  in  either  case,  Opi  at  OpZ'  . 

Consider  the  extensions  of  x  which  are  also  prefixes  of  z'. 
Since  (OK)^,;3i  at  Opi  by  the  observations  made  so  far,  there 
must  exist  extensions  y  and  y'  (of  x)  where  j/  is  a  one  event 
extension  of  y',  such  that  lf>K)„_iPi  at  either  Pi  ut  y 
(when  y  =  z)  or  lf>K)^^Pi  at  Opp,  and  in  either  case,  OPi 
at  Opp.  Let  y  =  (y'\ Cp<)  for  some  event  ep<  where  p'  /  p. 

First  we  show  that  ep>  is  a  write  event.  Assume  that  Cp' 
is  not  a  write  event.  By  RW\-RWZ,  (opy'  —  y')  =  (opy  —  y) 
and  hence  Opj/'[N  -  {p'}]opl/.  Also,  the  values  of  all  shared 
registers  are  the  same  in  Opy  and  Opp',  and  as  already  men¬ 
tioned  (OK)^jPi  at  Opy'.  By  Lemma  1,  -lOPi  at  Opp',  a 
contradiction.  Therefore,  Cp'  must  be  a  write  event. 

We  notice  that,  by  RWl  w  =  (opj/';ep')  is  a  run.  Also, 
since  if>K)„_^Pi  at  Opp',  it  must  be  the  case  that  for  G  = 
N  —  {p'},  OgKgPi  at  w. 

Next  we  show  that  (opp'  —  p')  is  a  write  event.  Assume 
(opp'-p')  is  not  a  write  event.  Then,  u;[N-{p})p,  and  w[N- 
{p}]opp.  Also,  the  values  of  all  shared  registers  are  the  same 
in  w,  y  and  Opp.  Because  w  >  Opp',  oPi  at  w.  Now,  recall 
that  either  P2  at  p  or  at  Opp.  It  follows  that,  either 

(O/O,_i/02  at  p  or  iOK)^^P2  at  Opp.  By  applying  Lemma  1, 
in  both  cases  -<OPi  at  to,  a  contradiction.  Therefore,  for 
two  registers  ri  and  rj,  and  values  vi  and  vj,  (opp'  -  = 

writep(ri,vi),  and  (p  -  p')  =  writep>(r2,V2). 

Assume  ri  ^  rj.  Since  the  two  write  events  are  indepen¬ 
dent,  the  values  of  all  shared  registers  are  the  same  in  w  and 
Opp.  Also,  to(Af]opp  and  hence  to[^]opp  for  G  =  ^  —  {p^}-  As 
pointed  out,  OgKgPi  at  to,  and  hence  by  Lemma  1,  ->0^ 
at  Opp,  a  contradiction. 

Assumeri  =  n.  Clearly,  volue(ri,Opp')  =  value(ri,Opp). 
Hence,  the  values  of  all  shared  registers  are  the  same  in  Opp' 
and  Opy.  Also,  Opp'jN  —  {p'}]opp  and  as  assumed  (OK)^,Pi 
at  Opp'.  As  before  by  Lemma  1,  -<OP2  at  Opp,  a  contradic¬ 
tion.  I 

Proof  of  Theorem  1:  Assume  to  the  contrary  that  for  some 
run  X  where  A{/?i,j9a}  at  x,  it  is  the  case  that  Og(KgPi  V 
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KgPi)  at  X  for  every  set  of  processes  Q  where  |5|  =  n  —  1. 
Using  Lemma  2  we  can  construct  inductively  starting  from 
the  run  x  an  n-fair  run  such  that  A{/3i,/32}  holds  at  all  the 
finite  prefixes  of  that  run,  and  ->Os{KnPi  V  KnPi)  at  x. 
Thus,  -'Oo{Kg0i  V  KgPi)  at  i  for  some  set  of  processes  Q 
where  15|  =  n  -  1,  a  contradiction.  I 

Theorem  2  In  any  asynchronous  read-modify-write  proto¬ 
col,  for  any  run  x  where  A{/3i,/32}  at  x,  if  OQ{Kgl3\'^ Kg^n) 
at  X  for  every  set  of  processes  Q  where  \Q\'>n  —  2,  then  the 
protocol  uses  at  least  one  non-binary  shared  register. 

In  order  to  prove  the  theorem  we  first  prove  two  lemmas. 
In  proving  these  lemmas  we  assume  that  all  shared  registers 
axe  binary  registers.  As  in  Theorem  1,  without  loss  of  gener¬ 
ality,  we  consider  in  this  proof  only  deterministic  processes, 
and  denote  by  OpX  the  unique  extension  of  x  by  a  single 
event  on  p.  We  point  out  that  Lemma  1,  and  the  observa¬ 
tion  following  it,  although  proved  for  read- write  protocols, 
hold  also  for  read-modify-write  protocols  with  essentially  the 
same  proof. 

Lemma  3  Let  x  be  a  run,  p  and  p'  be  two  processes  which 
are  enabled  at  x,  and  Q  =  N  —  {p,p').  If  OgKg/Ii  at  OpX 
then  at  Opix. 

Proof:  Assume  OgKgfii  at  OpX  and  let  {opX  —  x)  = 
rmwp(ri,v[,vj),  and  (Op,x  -  x)  =  rmtUp<(r2,vi,t)j),  for 
registers  rj  and  rt,  and  values  v{,vi,V2,V2  €  {0,1}.  Let 
y  =  Opi  Op  X  and  y'  ~  Op  Op«  x.  Note  that,  since  y  >  OpX  and 
y[g]  OpX,  <>gKg0\  at  y. 

Assume  ri  r2.  Since  the  two  events  are  independent, 
the  values  of  all  shared  registers  are  the  same  in  y  and  y', 
and  y[/V]y'.  By  the  note  above,  also  OgKgffi  at  y.  Thus, 
by  Lemma  1,  ->002  at  y',  and  hence  ->002  at  Op-x. 

Assume  ri  =  r2.  Then  v'l  =  v-t-  There  are  three  possible 
cases.  (1)  vi  =  V2.  Since,  OgKg0\  at  OpX,  Opx[Q]op,x,  and 
value(r,Opx)  =  value(r,  Op>x)  for  every  shared  register  r,  by 
Lemma  1  -<002  at  Op/x.  (2)  v[  =  vi.  By  RMWl,  y  = 
(opX;rmWp>{r2,V2,V2))  is  a  run.  Again  by  the  note  above, 
OgKg0i  at  y,  y[^]op/x,  and  value(r,y)  =  value(r,  Opi  x)  for 
any  shared  register  r,  by  Lemma  1,  ->002  at  Op/x.  (3)  vi  = 
V2.  The  values  of  all  sh2ired  registers  are  the  same  in  OpX  and 
y',  Opx[Q]y',  and  also  OgKg0\  at  OpX.  Thus,  by  Lemma  1, 
->002  at  y',  and  hence  ->002  at  Op/x.  I 

In  the  proof  of  the  next  lemm?.  the  notation  (OK)^,/3  from 
the  previous  section  is  used. 

Lemma  4  For  any  run  x  and  any  two  processes  p  and  p' 
which  are  enabled  at  x, 

if  Og{Kg0i  V  Kg02)  at  x  for  every  set  of  processes  Q  where 
\Q\>n-2  and  A{0\,02}  at  x  then  there  exists  y>x  such 
that  -<x\{p,p')]y  and  I^{0i,02}  at  y. 

Proof:  Assume  to  the  contrary  that  for  some  run  x  and 
two  processes  p  and  p'  which  are  enabled  ni  x,  Og{Kg0i '>/ 
Kg02)  at  X  for  every  set  of  processes  Q  where  \Q\>n  —  2, 
f\{0\,02)  at  X,  and  there  does  not  exist  y  >  x  such  that 
->x[{p,p'}]y  and  A{/?i,/32}  at  y.  Observe  that  it  follows 
from  the  assumption  that  for  any  extension  m  of  x  where 
->{0\  V  /J2)  at  m  and  both  p  and  p'  are  enabled  at  m,  that 


either  (OK)„_j/3i  at  Opm  or  °pt”i  either 

(OK)„_20i  at  Op«m  or  {OK)^_j02  at  Op.m.  Since  ->{0i  V  02) 
at  X,  assume  w  l.o.g.  that  (OK)„_j/3i  at  OpX.  By  Lemma  3, 
also  (OfC)„^30i  at  Op,x. 

Since  A{)3i,  %}  at  x,  there  exists  an  extension  z  of  x 
(z  ^  x)  such  that  002  at  z  and  for  any  x  <  y  <  z.  it 
is  the  case  that  A{0i,02}  at  y.  It  is  important  to  notice 
that  (OK)„_202  at  z.  Furthermore,  either  02  at  z  or  both 
(f^IC)„_202  at  OpZ  (if  OpZ  exists)  and  (0’A)„_j/?2  at  Cp/z  (if 
Op(Z  exists). 

Let  z'  be  the  longest  prefix  of  z  such  that  x[{p,p'}]z'. 
We  notice  that,  by  RMW2,  for  any  x  <  y  <  z'  both  p  and 
p'  are  enabled  at  y.  Since  z'  <  z,  it  follows  from  the  first 
assumption  and  Lemma  3  that  either  02  at  z'  (when  z'  =  z) 
or  {OIC)„_202  at  Opz'  and  (f>K)„_202  at  Op,z'.  In  any  case, 
002  at  both  Opz'  emd  Op,z'. 

Consider  the  extensions  of  x  which  are  also  prefixes  of  z'. 
Since  (OK)„_2/3*  at  OpX  there  must  exist  extensions  y  and 
y'  (of  x)  where  y  is  a  one  event  extension  of  y',  such  that 
(OK)„_3/3i  at  both  Opy'  and  Op.y',  and  either  02  at  y  (when 
y  =  z)  or  1(>K)„_^202  at  both  Opy  and  Op-y,  and  in  either  case, 
002  at  both  Opy  emd  Op.y.  Let  y  =  (y’;ep><)  for  some  event 
ep».  where  p"  ^  {p,p'}. 

For  some  registers  ri  and  r2,  and  values  vi,vi,vi,V2  G 
{0,1},  (opy'  -  y')  =  rmwp{ri,v\,vi),  and  (y  -  y')  = 
rm‘Wp"{r2,V2,V2). 

Assume  ri  r2.  By  RMWl, 

=  {y';»’mtVp(ri, Vj,vi);rmtVp<«(r2,V2,V2))  is  a  run.  The 
values  of  all  shared  registers  are  the  same  in  w  and  Opy  and 
w\N]opy.  Let  g  =  {N-p"},  because  tv  >  Opy'  and  tv[51opy', 
OgKg0\  at  tv.  By  Lemma  1,  -^002  at  Opy,  a  contradiction. 

Assume  rt  =  rz.  Then  v}  =  v^.  There  are  three  possible 
cases.  (1)  V2  =  V2.  Since  {OK)„_20i  at  Opy',  Opy'[N-{p"}]op 
y,  and  the  values  of  all  shared  registers  are  the  same  in  Opy' 
and  Opy,  by  Lemma  1,  ->002  at  Opy,  a  contradiction. 

(2)  v'l  —  vj.  Let  It  =  N  —  {p,p'].  Recall  that  either  02  at 
y  or  (^K)„_2  02  at  Op<y.  It  follows  that,  either  {OK)^_^02  at 
y  or  (O/0„_j/?2  at  Op>y,  and  in  either  case  O-hK-h02  at  Op.y. 
Let  s  =  Op'  Op"  Opy' .  Because  s  >  Opy' ,  O0i  at  s.  Since 
Op/y['H]s,  and  the  values  of  all  shared  registers  are  the  same 
in  Op,y  and  s,  by  Lemma  1,  ->O0i  at  a,  a  contradiction. 

(3)  vi  =  V2.  Recall  that  either  02  at  y  or  (<>I()„_2  02  at 
Op>y.  It  follows  that,  either  1$>K)„..2  02  at  y  or  (f>K)„_2  02 
Op,y.  Assume  {OK)„_j02  at  y.  Since  y[N  -  {p,p"}]  Op  y', 
and  the  values  of  all  shared  registers  are  the  same  in  y  and 
Opy',  by  Lemma  1,  ->O0i  at  Opy',  a  contradiction.  Assume 
(0K)„_,/32  at  Op,y.  Let  i  =  Op.  Op  y'.  Because  £  >  Opy',  O0i 
at  i.  Since  Op.y[iV  —  {p,p"}]£,  and  the  values  of  all  shared 
registers  are  the  same  in  Op,y  amd  t,  by  Lemma  1,  ->O0i  at 
£,  a  contradiction.  I 

Proof  of  Theorem  2:  Assume  to  the  contrary  that  for  some 
run  X  where  A{0i,02}  at  x,  it  is  the  case  that  Og(Kg0i  V 
Kg02)  at  X  for  every  set  of  processes  g  where  \g\  =  n  — 
2.  Using  Lemma  4  we  can  construct  inductively  starting 
from  the  run  x  an  (n  -  l)-fair  run  such  that  A{0i , 02}  holds 
at  all  the  finite  prefixes  of  that  run.  Thus,  ->Oc(Kg0i  V 
Kg02)  at  X  for  some  set  of  processes  g  where  \g\  —  n  —  2,  a 
contradiction.  I 
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